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Preface 


We have made substantial changes in this edition of Introduction to Mathematical 
Statistics. Some of these changes help students appreciate the connection between 
statistical theory and statistical practice while other changes enhance the develop- 
ment and discussion of the statistical theory presented in this book. 

Many of the changes in this edition reflect comments made by our readers. One 
of these comments concerned the small number of real data sets in the previous 
editions. In this edition, we have included more real data sets, using them to 
illustrate statistical methods or to compare methods. Further, we have made these 
data sets accessible to students by including them in the free R package hmcpkg. 
They can also be individually downloaded in an R session at the url listed below. 
In general, the R code for the analyses on these data sets is given in the text. 

We have also expanded the use of the statistical software R. We selected R 
because it is a powerful statistical language that is free and runs on all three main 
platforms (Windows, Mac, and Linux). Instructors, though, can select another 
statistical package. We have also expanded our use of R functions to compute 
analyses and simulation studies, including several games. We have kept the level of 
coding for these functions straightforward. Our goal is to show students that with 
a few simple lines of code they can perform significant computations. Appendix B 
contains a brief R primer, which suffices for the understanding of the R used in the 
text. As with the data sets, these R functions can be sourced individually at the 
cited url; however, they are also included in the package hmcpkg. 

We have supplemented the mathematical review material in Appendix A, placing 
it in the document Mathematical Primer for Introduction to Mathematical Statistics. 
It is freely available for students to download at the listed url. Besides sequences, 
this supplement reviews the topics of infinite series, differentiation, and integra- 
tion (univariate and bivariate). We have also expanded the discussion of iterated 
integrals in the text. We have added figures to clarify discussion. 

We have retained the order of elementary statistical inferences (Chapter 4) and 
asymptotic theory (Chapter 5). In Chapters 5 and 6, we have written brief reviews 
of the material in Chapter 4, so that Chapters 4 and 5 are essentially independent 
of one another and, hence, can be interchanged. In Chapter 3, we now begin the 
section on the multivariate normal distribution with a subsection on the bivariate 
normal distribution. Several important topics have been added. This includes 
Tukey’s multiple comparison procedure in Chapter 9 and confidence intervals for 
the correlation coefficients found in Chapters 9 and 10. Chapter 7 now contains a 


xi 


xii Preface 


discussion on standard errors for estimates obtained by bootstrapping the sample. 
Several topics that were discussed in the Exercises are now discussed in the text. 
Examples include quantiles, Section 1.7.1, and hazard functions, Section 3.3. In 
general, we have made more use of subsections to break up some of the discussion. 
Also, several more sections are now indicated by * as being optional. 


Content and Course Planning 


Chapters 1 and 2 develop probability models for univariate and multivariate vari- 
ables while Chapter 3 discusses many of the most widely used probability models. 
Chapter 4 discusses statistical theory for much of the inference found in a stan- 
dard statistical methods course. Chapter 5 presents asymptotic theory, concluding 
with the Central Limit Theorem. Chapter 6 provides a complete inference (esti- 
mation and testing) based on maximum likelihood theory. The EM algorithm is 
also discussed. Chapters 7-8 contain optimal estimation procedures and tests of 
statistical hypotheses. The final three chapters provide theory for three important 
topics in statistics. Chapter 9 contains inference for normal theory methods for 
basic analysis of variance, univariate regression, and correlation models. Chapter 
10 presents nonparametric methods (estimation and testing) for location and uni- 
variate regression models. It also includes discussion on the robust concepts of 
efficiency, influence, and breakdown. Chapter 11 offers an introduction to Bayesian 
methods. This includes traditional Bayesian procedures as well as Markov Chain 
Monte Carlo techniques. 

Several courses can be designed using our book. The basic two-semester course 
in mathematical statistics covers most of the material in Chapters 1-8 with topics 
selected from the remaining chapters. For such a course, the instructor would have 
the option of interchanging the order of Chapters 4 and 5, thus beginning the second 
semester with an introduction to statistical theory (Chapter 4). A one-semester 
course could consist of Chapters 1-4 with a selection of topics from Chapter 5. 
Under this option, the student sees much of the statistical theory for the methods 
discussed in a non-theoretical course in methods. On the other hand, as with the 
two-semester sequence, after covering Chapters 1-3, the instructor can elect to cover 
Chapter 5 and finish the course with a selection of topics from Chapter 4. 

The data sets and R functions used in this book and the R package hmcpkg can 
be downloaded at the site: 
https: //media.pearsoncmg. com/cmg/pmmg_mm1_shared/mathstatsresources 
/nome/index. html 
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Chapter 1 


Probability and Distributions 


1.1 Introduction 


In this section, we intuitively discuss the concepts of a probability model which we 
formalize in Secton 1.3 Many kinds of investigations may be characterized in part 
by the fact that repeated experimentation, under essentially the same conditions, 
is more or less standard procedure. For instance, in medical research, interest may 
center on the effect of a drug that is to be administered; or an economist may be 
concerned with the prices of three specified commodities at various time intervals; or 
an agronomist may wish to study the effect that a chemical fertilizer has on the yield 
of a cereal grain. The only way in which an investigator can elicit information about 
any such phenomenon is to perform the experiment. Each experiment terminates 
with an outcome. But it is characteristic of these experiments that the outcome 
cannot be predicted with certainty prior to the experiment. 

Suppose that we have such an experiment, but the experiment is of such a nature 
that a collection of every possible outcome can be described prior to its performance. 
If this kind of experiment can be repeated under the same conditions, it is called 
a random experiment, and the collection of every possible outcome is called the 
experimental space or the sample space. We denote the sample space by C. 


Example 1.1.1. In the toss of a coin, let the outcome tails be denoted by T and let 
the outcome heads be denoted by H. If we assume that the coin may be repeatedly 
tossed under the same conditions, then the toss of this coin is an example of a 
random experiment in which the outcome is one of the two symbols T or H; that 
is, the sample space is the collection of these two symbols. For this example, then, 
C={H,T}. = 


Example 1.1.2. In the cast of one red die and one white die, let the outcome be the 
ordered pair (number of spots up on the red die, number of spots up on the white 
die). If we assume that these two dice may be repeatedly cast under the same con- 
ditions, then the cast of this pair of dice is a random experiment. The sample space 
consists of the 36 ordered pairs: C = {(1,1),..., (1,6), (2,1),..., (2,6),..., (6,6)}. 
| 


2 Probability and Distributions 


We generally use small Roman letters for the elements of C such as a,b, or 
c. Often for an experiment, we are interested in the chances of certain subsets of 
elements of the sample space occurring. Subsets of C are often called events and are 
generally denoted by capitol Roman letters such as A, B,or C. If the experiment 
results in an element in an event A, we say the event A has occurred. We are 
interested in the chances that an event occurs. For instance, in Example 1.1.1 we 
may be interested in the chances of getting heads; i.e., the chances of the event 
A = {H} occurring. In the second example, we may be interested in the occurrence 
of the sum of the upfaces of the dice being “7” or “11;” that is, in the occurrence of 
the event A = {(1,6), (2,5), (3, 4), (4, 3), (5, 2), (6, 1), (5, 6), (6,5)}. 

Now conceive of our having made N repeated performances of the random ex- 
periment. Then we can count the number f of times (the frequency) that the 
event A actually occurred throughout the N performances. The ratio f/N is called 
the relative frequency of the event A in these N experiments. A relative fre- 
quency is usually quite erratic for small values of N, as you can discover by tossing 
a coin. But as N increases, experience indicates that we associate with the event A 
a number, say p, that is equal or approximately equal to that number about which 
the relative frequency seems to stabilize. If we do this, then the number p can be 
interpreted as that number which, in future performances of the experiment, the 
relative frequency of the event A will either equal or approximate. Thus, although 
we cannot predict the outcome of a random experiment, we can, for a large value 
of N, predict approximately the relative frequency with which the outcome will be 
in A. The number p associated with the event A is given various names. Some- 
times it is called the probability that the outcome of the random experiment is in 
A; sometimes it is called the probability of the event A; and sometimes it is called 
the probability measure of A. The context usually suggests an appropriate choice of 
terminology. 


Example 1.1.3. Let C denote the sample space of Example 1.1.2 and let B be 
the collection of every ordered pair of C for which the sum of the pair is equal to 
seven. Thus B = {(1,6), (2,5), (3, 4), (4,3), (5, 2)(6,1)}. Suppose that the dice are 
cast NV = 400 times and let f denote the frequency of a sum of seven. Suppose that 
400 casts result in f = 60. Then the relative frequency with which the outcome 
was in Bis f/N = a = 0.15. Thus we might associate with B a number p that is 
close to 0.15, and p would be called the probability of the event B. m 


Remark 1.1.1. The preceding interpretation of probability is sometimes referred 
to as the relative frequency approach, and it obviously depends upon the fact that an 
experiment can be repeated under essentially identical conditions. However, many 
persons extend probability to other situations by treating it as a rational measure 
of belief. For example, the statement p = = for an event A would mean to them 
that their personal or subjective probability of the event A is equal to 2. Hence, 
if they are not opposed to gambling, this could be interpreted as a willingness on 
their part to bet on the outcome of A so that the two possible payoffs are in the 
ratio p/(1 — p) = 2/2 = ¥. Moreover, if they truly believe that p = 2 is correct, 
they would be willing to accept either side of the bet: (a) win 3 units if A occurs 
and lose 2 if it does not occur, or (b) win 2 units if A does not occur and lose 3 if 
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it does. However, since the mathematical properties of probability given in Section 
1.3 are consistent with either of these interpretations, the subsequent mathematical 
development does not depend upon which approach is used. @ 


The primary purpose of having a mathematical theory of statistics is to provide 
mathematical models for random experiments. Once a model for such an experi- 
ment has been provided and the theory worked out in detail, the statistician may, 
within this framework, make inferences (that is, draw conclusions) about the ran- 
dom experiment. The construction of such a model requires a theory of probability. 
One of the more logically satisfying theories of probability is that based on the 
concepts of sets and functions of sets. These concepts are introduced in Section 1.2. 


1.2 Sets 


The concept of a set or a collection of objects is usually left undefined. However, 
a particular set can be described so that there is no misunderstanding as to what 
collection of objects is under consideration. For example, the set of the first 10 
positive integers is sufficiently well described to make clear that the numbers 3 and 
14 are not in the set, while the number 3 is in the set. If an object belongs to a 
set, it is said to be an element of the set. For example, if C denotes the set of real 
numbers x for which 0 < x < 1, then 3 is an element of the set C. The fact that 
3 is an element of the set C is indicated by writing # € C. More generally, c € C 
means that c is an element of the set C. 

The sets that concern us are frequently sets of numbers. However, the language 
of sets of points proves somewhat more convenient than that of sets of numbers. 
Accordingly, we briefly indicate how we use this terminology. In analytic geometry 
considerable emphasis is placed on the fact that to each point on a line (on which 
an origin and a unit point have been selected) there corresponds one and only one 
number, say x; and that to each number « there corresponds one and only one point 
on the line. This one-to-one correspondence between the numbers and points on a 
line enables us to speak, without misunderstanding, of the “point x” instead of the 
“number x.” Furthermore, with a plane rectangular coordinate system and with x 
and y numbers, to each symbol (x, y) there corresponds one and only one point in the 
plane; and to each point in the plane there corresponds but one such symbol. Here 
again, we may speak of the “point (x, y),” meaning the “ordered number pair x and 
y.” This convenient language can be used when we have a rectangular coordinate 
system in a space of three or more dimensions. Thus the “point (a1, 22,...,%n)” 
means the numbers #1, #2,.-., pn in the order stated. Accordingly, in describing our 
sets, we frequently speak of a set of points (a set whose elements are points), being 
careful, of course, to describe the set so as to avoid any ambiguity. The notation 
C= {x:0 <a < 1} is read “C is the one-dimensional set of points a for which 
0<a <1.” Similarly, C = {(z7,y):0<a2<1,0< y< 1} can be read “C is the 
two-dimensional set of points (a, y) that are interior to, or on the boundary of, a 
square with opposite vertices at (0,0) and (1, 1).” 

We say a set C' is countable if C is finite or has as many elements as there are 
positive integers. For example, the sets C; = {1,2,...,100} and Cp = {1,3,5,7,...} 
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are countable sets. The interval of real numbers (0, 1], though, is not countable. 


1.2.1 Review of Set Theory 


As in Section 1.1, let C denote the sample space for the experiment. Recall that 
events are subsets of C. We use the words event and subset interchangeably in this 
section. An elementary algebra of sets will prove quite useful for our purposes. We 
now review this algebra below along with illustrative examples. For illustration, we 
also make use of Venn diagrams. Consider the collection of Venn diagrams in 
Figure 1.2.1. The interior of the rectangle in each plot represents the sample space 
C. The shaded region in Panel (a) represents the event A. 


Panel (a) Panel (b) 
G) 
A 
A AcB 
Panel (c) Panel (d) 
A B A B 
AUB ANB 


Figure 1.2.1: A series of Venn diagrams. The sample space C is represented by 
the interior of the rectangle in each plot. Panel (a) depicts the event A; Panel (b) 
depicts A Cc B; Panel (c) depicts AU B; and Panel (d) depicts AN B. 


We first define the complement of an event A. 


Definition 1.2.1. The complement of an event A is the set of all elements in C 
which are not in A. We denote the complement of A by A°. That is, AS = {x EC: 


x ¢ A}. 
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The complement of A is represented by the white space in the Venn diagram in 
Panel (a) of Figure 1.2.1. 

The empty set is the event with no elements in it. It is denoted by ¢. Note 
that C° = ¢ and ¢° =C. The next definition defines when one event is a subset of 
another. 


Definition 1.2.2. If each element of a set A is also an element of set B, the set A 
is called a subset of the set B. This is indicated by writing AC B. If AC B and 
also B C A, the two sets have the same elements, and this is indicated by writing 
A=B. 


Panel (b) of Figure 1.2.1 depicts A C B. 
The event A or B is defined as follows: 


Definition 1.2.3. Let A and B be events. Then the union of A and B is the set 
of all elements that are in A or in B or in both A and B. The union of A and B 
is denoted by AUB 


Panel (c) of Figure 1.2.1 shows AU B. 
The event that both A and B occur is defined by, 


Definition 1.2.4. Let A and B be events. Then the intersection of A and B is 
the set of all elements that are in both A and B. The intersection of A and B is 
denoted by AN B 


Panel (d) of Figure 1.2.1 illustrates AM B. 
Two events are disjoint if they have no elements in common. More formally we 
define 


Definition 1.2.5. Let A and B be events. Then A and B are disjoint if ANB = ¢ 


If A and B are disjoint, then we say AU B forms a disjoint union. The next two 
examples illustrate these concepts. 


Example 1.2.1. Suppose we have a spinner with the numbers 1 through 10 on 
it. The experiment is to spin the spinner and record the number spun. Then 
C = {1,2,...,10}. Define the events A, B, and C by A = {1,2}, B = {2,3,4}, and 
C = {3,4,5,6}, respectively. 


A® = {3,4,...,10}; AUB={1,2,3,4}; ANB = {2} 

ANC=¢; BNC= {3,4}; BNCCB,; BNCcC 

AU(BNC) = {1,2} U {3, 4} = {1, 2,3, 4} 12s 

(AU B)N (AUC) = {1,2,3,4} 9 {1, 2,3, 4,5, 6} = {1,2,3,4} (1.2.2) 
The reader should verify these results. m 


Example 1.2.2. For this example, suppose the experiment is to select a real number 
in the open interval (0,5); hence, the sample space is C = (0,5). Let A = (1,3), 
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B = (2,4), and C = [3, 4.5). 


AUB=(1,4); ANB=(2,3); BNC =[B,4) 
An (BUC) = (1,3)N (2,4.5) = (2,3) 
(AN B)U(ANC) = (2,3) Ud = (2,3) 


A sketch of the real number line between 0 and 5 helps to verify these results. m 


Expressions (1.2.1)—(1.2.2) and (1.2.3)—(1.2.4) are illustrations of general dis- 
tributive laws. For any sets A, B, and C, 


AN(BUC) = (ANB)U(ANC) 

AU(BNC) = (AUB)N(AUC). (1.2.5) 
These follow directly from set theory. To verify each identity, sketch Venn diagrams 
of both sides. 


The next two identities are collectively known as DeMorgan’s Laws. For any 
sets A and B, 


(AN BY? Ac UBS 
(AUB) = A°N Be. 


For instance, in Example 1.2.1, 
(AUB)* = {1,2,3,4}° = {5,6,...,10} = {3,4,..., LO}N{{1,5,6,...,10} = ACN B®; 
while, from Example 1.2.2, 

(AN B)* = (2,3)* = (0, 2] U [8, 5) = [(0, 1] U [8, 5)] U [(0, 2] U [4,5)] = ASU B®. 


As the last expression suggests, it is easy to extend unions and intersections to more 


than two sets. If A,, A2,..., An are any sets, we define 
A,UA2QU::-UA, = {xu:a€ Aj, for somei=1,2,...,n} (1.2.8) 
AiNAgN:::-N A, = {a:a€ Aj, for alli =1,2,...,n}. 


We often abbreviative these by U?_, A; and N7_, A;, respectively. Expressions for 
countable unions and intersections follow directly; that is, if Ay, A2,..., An... isa 
sequence of sets then 
A,;UAg.U::: = {u:xu€ Ay, for some n = 1,2,...} =UPL, A, (1.2.10) 
Ai NAgN:+) = {a:u€ An, for alln =1,2,...}=MP2,Ay. (1.2.11) 


The next two examples illustrate these ideas. 
Example 1.2.3. Suppose C = {1,2,3,...}. If A, = {1,3,...,2n —1} and B, = 
{n,n+1,...}, for n =1,2,3,..., then 
UWP An = {1,3y5;e0h) ORL pAn = 41}; (1.2.12) 
U2, Bn =C; N°,B,=¢. & (1.2.13) 
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Example 1.2.4. Suppose C is the interval of real numbers (0,5). Suppose C,, = 
(1—n—1,2+n7') and D, =(n“1,3—n7), for n =1,2,3,.... Then 


UAC =T05)> nec, = 11,9) (1.2.14) 
US, Dn = (0,3); AP,Dn = (1,2). » (1.2.15) 


We occassionally have sequences of sets that are monotone. They are of two 
types. We say a sequence of sets {A,} is nondecreasing, (nested upward), if 
An C An+1 for n = 1,2,3,.... For such a sequence, we define 

lim A, = UP@, An. (1.2.16) 
n—Cco 
The sequence of sets A, = {1,3,...,2n — 1} of Example 1.2.3 is such a sequence. 
So in this case, we write lim,.. An = {1,3,5,...}. The sequence of sets {D,,} of 
Example 1.2.4 is also a nondecreasing suquence of sets. 

The second type of monotone sets consists of the nonincreasing, (nested 
downward) sequences. A sequence of sets {A,,} is nonincreasing, if A, D An+1 
for n = 1,2,3,.... In this case, we define 


lim Ap = 2, An. (1.2.17) 


The sequences of sets {B,,} and {C,,} of Examples 1.2.3 and 1.2.4, respectively, are 
examples of nonincreasing sequences of sets. 


1.2.2 Set Functions 


Many of the functions used in calculus and in this book are functions that map real 
numbers into real numbers. We are concerned also with functions that map sets 
into real numbers. Such functions are naturally called functions of a set or, more 
simply, set functions. Next we give some examples of set functions and evaluate 
them for certain simple sets. 


Example 1.2.5. Let C = R, the set of real numbers. For a subset A in C, let Q(A) 
be equal to the number of points in A that correspond to positive integers. Then 
Q(A) is a set function of the set A. Thus, if A= {x#:0 <a < 5}, then Q(A) = 4; 
if A = {—2,—1}, then Q(A) = 0; and if A = {x : —oo < x < 6}, then Q(A) =5. 


Example 1.2.6. Let C = R?. For a subset A of C, let Q(A) be the area of A 
if A has a finite area; otherwise, let Q(A) be undefined. Thus, if A = {(a,y) : 
x? +y* < 1}, then Q(A) = 7m; if A = {(0,0), (1,1), (0, 1)}, then Q(A) = 0; and if 
A={(z,y):0<2,0<y,c+y <1}, then Q(A) = 5. 


Often our set functions are defined in terms of sums or integrals.! With this in 
mind, we introduce the following notation. The symbol 


iL f(a) de 


1Please see Chapters 2 and 3 of Mathematical Comments, at site noted in the Preface, for a 
review of sums and integrals 
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means the ordinary (Riemann) integral of f(x) over a prescribed one-dimensional 


set A and the symbol 
[foe dardy 
A 


means the Riemann integral of g(a,y) over a prescribed two-dimensional set A. 
This notation can be extended to integrals over n dimensions. To be sure, unless 
these sets A and these functions f(x) and g(x,y) are chosen with care, the integrals 
frequently fail to exist. Similarly, the symbol 


df) 
A 
means the sum extended over all x € A and the symbol 


S> So a(a,y) 
A 


means the sum extended over all (x,y) € A. As with integration, this notation 
extends to sums over n dimensions. 

The first example is for a set function defined on sums involving a geometric 
series. As pointed out in Example 2.3.1 of Mathematical Comments,” if |a| < 1, 
then the following series converges to 1/(1 — a): 


= 1 
Soa" =—.,, if lal <1. (1.2.18) 
=o l-a 


Example 1.2.7. Let C be the set of all nonnegative integers and let A be a subset 
of C. Define the set function Q by 


Q(A) = >> (=). (1.2.19) 


neA 


It follows from (1.2.18) that Q(C) = 3. If A = {1, 2,3} then Q(A) = 38/27. Suppose 
B= {1,3,5,...} is the set of all odd positive integers. The computation of Q(B) is 
given next. This derivation consists of rewriting the series so that (1.2.18) can be 
applied. Frequently, we perform such derivations in this book. 


am = 5) -£ 0)" 


neEeB n=0 
9 f/2\7]" 2 1 6 
= 29-|(2) =31-q@p 75" 


In the next example, the set function is defined in terms of an integral involving 
the exponential function f(a#) =e". 


Downloadable at site noted in the Preface 
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Example 1.2.8. Let C be the interval of positive real numbers, i.e., C = (0,00). 
Let A be a subset of C. Define the set function Q by 


Q(A) = fe dx, (1.2.20) 


provided the integral exists. The reader should work through the following integra- 
tions: 


=e! — e3=0.318 


= e °=0.007 


5 3 5 
ai(t.3)u 8,5) = [ ede = | cr de+ f e~* dr = Q[(1,3)] + Q([3,5)] 
ae) = | e “dr=1. @ 


Our final example, involves an n dimensional integral. 


Example 1.2.9. Let C = R". For A in C define the set function 


Q(A)= f+ [aerdea--den, 
A 


provided the integral exists. For example, if A = {(x1,%2,...,@n) :0< a1 < 
x2,0 < a; <1, for 1 =3,4,...,n}, then upon expressing the multiple integral as 
an iterated integral? we obtain 


Q(A) = [ ie ae,| dive I if ae, 


u 


x5 1 
al, = 
If B= {(21,%2,...,2n):05 41 S29 <---< ay < VY, i 
1 Lan x3 @2 
0 0 0 0 
—. 
= nl? 


where n!=n(n—1)---3-2-1. mf 


3For a discussion of multiple integrals in terms of iterated integrals, see Chapter 3 of Mathe- 
matical Comments. 
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EXERCISES 


1.2.1. Find the union C; U C2 and the intersection C, M C2 of the two sets Cy and 
C2, where 


(a) Cy = {0,1,2,}, Co = (2,3, 4}. 

(b) Cr={x:0<a4<2},Ch={a:1l<a< 3}. 

(c) CL ={(az,y):0<a<2,1<y<2},Q={(4,y):1l<a<3,1<y< 3}. 
1.2.2. Find the complement C* of the set C' with respect to the space C if 
(a) C={2:0<e< 1), C=i9i2 << 1b 

(b) C= {(z,y,2z): 0? +y? +2? <1}, C={(a,y,z): 2? +y? +27 =1}. 

(c) C = {(a,y) : |x| + ly] < 2}, C = {(w,y) sa? +y? < 2}. 


1.2.3. List all possible arrangements of the four letters m,a,r, and y. Let Cj, be 
the collection of the arrangements in which y is in the last position. Let Cg be the 
collection of the arrangements in which m is in the first position. Find the union 
and the intersection of Cy and C9. 


1.2.4. Concerning DeMorgan’s Laws (1.2.6) and (1.2.7): 

(a) Use Venn diagrams to verify the laws. 

(b) Show that the laws are true. 

(c) Generalize the laws to countable unions and intersections. 


1.2.5. By the use of Venn diagrams, in which the space C is the set of points 
enclosed by a rectangle containing the circles C,, C2, and C3, compare the following 
sets. These laws are called the distributive laws. 


(a) On M (Co U Cs) and (C1 M C2) U (Ci {l C3). 
(b) Ci U (C2 M C3) and (Ci U C2) M (C1 U C3). 


1.2.6. Show that the following sequences of sets, {C;,}, are nondecreasing, (1.2.16), 
then find limz_... Cr. 


(a) Ch ={x:1/k<a<3-1/k},k =1,2,3,.... 
(b) Cy ={(a,y):1/k <a? +7? <4—-1/k}, k=1,2,3,.... 


1.2.7. Show that the following sequences of sets, {C;,}, are nonincreasing, (1.2.17), 
then find limz_... Cr. 


(a) CO, ={x:2-1/k<a< 2},k=1,2,3,.... 
(b) Ch ={a:2<a2<24+1/k},k =1,2,3,.... 
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(c) Cye= {la,y) 20 <a? ey? < 1/e}, b= 1,2,3,.... 


1.2.8. For every one-dimensional set C, define the function Q(C) = io f(z), 
where f(x) = (2)($)”, « = 0,1,2,..., zero elsewhere. If C, = {x : x = 0,1,2,3} 
and C2 = {w:«=0,1,2,...}, find Q(C)) and Q(C2). 

Hint: Recall that S, = a+ar+---+ar"! = a(1—r”)/(1—7r) and, hence, it 


follows that limps. Sn = a/(1— 1) provided that |r| < 1. 


1.2.9. For every one-dimensional set C' for which the integral exists, let Q(C) = 
Jo F(x) dx, where f(x) = 6a(1— a), 0 < x <1, zero elsewhere; otherwise, let Q(C) 
be undefined. If Cy = {w: 4 <a < 3}, Cp = {5}, and C3 = {x : 0 < x < 10}, find 
Q(C1), Q(C2), and Q(C3). 


1.2.10. For every two-dimensional set C contained in R? for which the integral 
exists, let Q(C) = [ [o (a? +") dedy. If C, = {(z,y) :—-1< #@ <1,-1<y <1}, 
Cy = {(x,y) :-l<r=y< 1}, and C3 = {(x,y) : ay < Is find Q(C1), Q(C2), 
and Q(C3). 


1.2.11. Let C denote the set of points that are interior to, or on the boundary of, a 
square with opposite vertices at the points (0,0) and (1,1). Let Q(C) = J J. dyds. 


(a) If C CC is the set {(a,y):0<a<y< 1}, compute Q(C). 
(b) If C CC is the set {(a,y) :0<a2=y < 1}, compute Q(C). 
(c) If C CC is the set {(x,y) :0< 4/2 <y < 32/2 < 1}, compute Q(C). 


1.2.12. Let C be the set of points interior to or on the boundary of a cube with 
edge of length 1. Moreover, say that the cube is in the first octant with one vertex 
at the point (0,0,0) and an opposite vertex at the point (1,1,1). Let Q(C) = 
TFT c dudydz. 


(a) If C CC is the set {(a,y,z):0<a4<y<z< 1}, compute Q(C). 


(b) If C is the subset {(x,y,z):0<a%=y=z <1}, compute Q(C). 


1.2.13. Let C denote the set {(x,y,z) : x? +y? + 2? < 1}. Using spherical coordi- 


nates, evaluate 
a= ff ve y? + 2? dxdydz. 
C 


1.2.14. To join a certain club, a person must be either a statistician or a math- 
ematician or both. Of the 25 members in this club, 19 are statisticians and 16 
are mathematicians. How many persons in the club are both a statistician and a 
mathematician? 


1.2.15. After a hard-fought football game, it was reported that, of the 11 starting 
players, 8 hurt a hip, 6 hurt an arm, 5 hurt a knee, 3 hurt both a hip and an arm, 
2 hurt both a hip and a knee, 1 hurt both an arm and a knee, and no one hurt all 
three. Comment on the accuracy of the report. 
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1.3. The Probability Set Function 


Given an experiment, let C denote the sample space of all possible outcomes. As 
discussed in Section 1.1, we are interested in assigning probabilities to events, i.e., 
subsets of C. What should be our collection of events? If C is a finite set, then we 
could take the set of all subsets as this collection. For infinite sample spaces, though, 
with assignment of probabilities in mind, this poses mathematical technicalities that 
are better left to a course in probability theory. We assume that in all cases, the 
collection of events is sufficiently rich to include all possible events of interest and is 
closed under complements and countable unions of these events. Using DeMorgan’s 
Laws, (1.2.6)—(1.2.7), the collection is then also closed under countable intersections. 
We denote this collection of events by 6. Technically, such a collection of events is 
called a o-field of subsets. 

Now that we have a sample space, C, and our collection of events, 6, we can define 
the third component in our probability space, namely a probability set function. In 
order to motivate its definition, we consider the relative frequency approach to 
probability. 


Remark 1.3.1. The definition of probability consists of three axioms which we 
motivate by the following three intuitive properties of relative frequency. Let C be 
a sample space and let A C C. Suppose we repeat the experiment N times. Then 
the relative frequency of A is fa = #{A}/N, where #{A} denotes the number of 
times A occurred in the N repetitions. Note that f4 > 0 and fe = 1. These are 
the first two properties. For the third, suppose that A; and Ag are disjoint events. 
Then f4,ua, = fa, + fa. These three properties of relative frequencies form the 
axioms of a probability, except that the third axiom is in terms of countable unions. 
As with the axioms of probability, the readers should check that the theorems we 
prove below about probabilities agree with their intuition of relative frequency. 


Definition 1.3.1 (Probability). Let C be a sample space and let B be the set of 
events. Let P be a real-valued function defined on B. Then P is a probability set 
function if P satisfies the following three conditions: 


1. P(A) >0, for all AEB. 
2. P(C)=1. 


3. If {An} is a sequence of events in B and Ay, N An = ¢ for allm 4 n, then 
P (U 4] =" Pl An): 
n=1 n=1 


A collection of events whose members are pairwise disjoint, as in (3), is said to 
be a mutually exclusive collection and its union is often referred to as a disjoint 
union. The collection is further said to be exhaustive if the union of its events is 
the sample space, in which case )7°*°_, P(A,) = 1. We often say that a mutually 
exclusive and exhaustive collection of events forms a partition of C. 
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A probability set function tells us how the probability is distributed over the set 
of events, B. In this sense we speak of a distribution of probability. We often drop 
the word “set” and refer to P as a probability function. 

The following theorems give us some other properties of a probability set func- 
tion. In the statement of each of these theorems, P(A) is taken, tacitly, to be a 
probability set function defined on the collection of events 6 of a sample space C. 


Theorem 1.3.1. For each event A € B, P(A) =1-— P(A‘). 


Proof: We have C = AU Af and AN A® = ¢. Thus, from (2) and (3) of Definition 
1.3.1, it follows that 
1 = P(A) + P(A‘), 


which is the desired result. 

Theorem 1.3.2. The probability of the null set is zero; that is, P(¢) = 0. 

Proof: In Theorem 1.3.1, take A = ¢ so that A° =C. Accordingly, we have 
P(¢) =1—P(C)=1-1=0 


and the theorem is proved. 


Theorem 1.3.3. If A and B are events such that A C B, then P(A) < P(B). 


Proof: Now B = AU(A°NB) and AN (ACN B) = ¢. Hence, from (3) of Definition 
Lad, 
P(B) = P(A) + P(ASN B). 


From (1) of Definition 1.3.1, P(A°N B) > 0. Hence, P(B) > P(A). @ 


Theorem 1.3.4. For each A € B, 0 < P(A) <1. 
Proof: Since ¢ C A CC, we have by Theorem 1.3.3 that 

P(¢) < P(A) < P(C) or O0< P(A) <1, 
the desired result. m 


Part (3) of the definition of probability says that P(AU B) = P(A) + P(B) if A 
and B are disjoint, i.., AN B= @. The next theorem gives the rule for any two 
events regardless if they are disjoint or not. 


Theorem 1.3.5. [f A and B are events inC, then 
P(AUB) = P(A) + P(B) -— P(ANB). 


Proof: Each of the sets AU B and B can be represented, respectively, as a union of 
nonintersecting sets as follows: 


AUB=AU(A°NB) and B=(ANB)U(ACNB). (1.3.1) 
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That these identities hold for all sets A and B follows from set theory. Also, the 
Venn diagrams of Figure 1.3.1 offer a verification of them. 
Thus, from (3) of Definition 1.3.1, 


P(AUB) = P(A) + P(A°NB) 


and 
P(B) = P(ANB)+ P(ASOB). 


If the second of these equations is solved for P(A°N B) and this result is substituted 
in the first equation, we obtain 


P(AUB) = P(A) + P(B) — P(ANB). 


This completes the proof. 


Panel (a) Panel (b) 
A B A B 
AUB=AU(ANB) A=(ANB‘)U(ANB) 


Figure 1.3.1: Venn diagrams depicting the two disjoint unions given in expression 
(1.3.1). Panel (a) depicts the first disjoint union while Panel (b) shows the second 
disjoint union. 


Example 1.3.1. Let C denote the sample space of Example 1.1.2. Let the proba- 
bility set function assign a probability of ~ to each of the 36 points in C; that is, the 
dice are fair. If Cy = {(1, 1), (2, 1), (3, 1), (4, 1), (5, 1)} and C2 = {(1, 2), (2, 2), (3, 2)}, 


then P(C)) = = P(C3) => = P(C, U C2) = =, and P(CQy M C2) =0. 8 


Example 1.3.2. Two coins are to be tossed and the outcome is the ordered pair 
(face on the first coin, face on the second coin). Thus the sample space may be 
represented as C = {(H, H), (H,T),(T, H),(T,T)}. Let the probability set function 
assign a probability of } to each element of C. Let C, = {(H,H),(H,T)} and 
C, = {(H,H),(T,H)}. Then P(C)) = P(C2) = 5, P(Ci NC.) = §, and, in 
accordance with Theorem 1.3.5, P(C; UC2)=53+353—-4=%.5 
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For a finite sample space, we can generate probabilities as follows. Let C = 
{x1,@2,...,2m} be a finite set of m elements. Let pi, p2,...,Pm be fractions such 
that 


OX p; <1 ford = 1); 2,.,.;m and 5p — 1. (1.3.2) 


Suppose we define P by 


P(A) = >. pi, for all subsets A of C. (1.3.3) 
riCA 


Then P(A) > 0 and P(C) = 1. Further, it follows that P(A U B) = P(A) + P(B) 
when AM B= @¢. Therefore, P is a probability on C. For illustration, each of the 
following four assignments forms a probability on C = {1,2,...,6}. For each, we 
also compute P(A) for the event A = {1,6}. 


Pl =p2="': = Pe 6 P(A) = 3° (1.3.4) 
pi = ps = 0.1, ps ='pa = pp = pg = 0.2; P(A) = 0.3. 
i 7 


Pi = 97? t= 1, 2, iacy'B3 P(A)=—. 
3 3 3 
Di = —, po = 1— —,p3 = pa = 5 = Pe = 0.0; P(A) =-—. 
T T T 


Note that the individual probabilities for the first probability set function, 
(1.3.4), are the same. This is an example of the equilikely case which we now 
formally define. 


Definition 1.3.2 (Equilikely Case). Let C = {21,x2,...,U%m} be a finite sample 
space. Let pj =1/m for alli =1,2,...,m and for all subsets A of C define 


p(a)= 4 = 2 


rica 


where #(A) denotes the number of elements in A. Then P is a probability onC and 
it is refereed to as the equilikely case. m 


Equilikely cases are frequently probability models of interest. Examples include: 
the flip of a fair coin; five cards drawn from a well shuffled deck of 52 cards; a spin of 
a fair spinner with the numbers 1 through 36 on it; and the upfaces of the roll of a 
pair of balanced dice. For each of these experiments, as stated in the definition, we 
only need to know the number of elements in an event to compute the probability 
of that event. For example, a card player may be interested in the probability of 
getting a pair (two of a kind) in a hand of five cards dealt from a well shuffled deck 
of 52 cards. To compute this probability, we need to know the number of five card 
hands and the number of such hands which contain a pair. Because the equilikely 
case is often of interest, we next develop some counting rules which can be used to 
compute the probabilities of events of interest. 
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1.3.1 Counting Rules 


We discuss three counting rules that are usually discussed in an elementary algebra 
course. 

The first rule is called the mn-rule (m times n-rule), which is also called the 
multiplication rule. Let A = {x1,22,...,%m} be a set of m elements and let 
B = {yi,y2,---;Yn} be a set of n elements. Then there are mn ordered pairs, 
(vi,yj), 2? = 1,2,...,m and j = 1,2,...,n, of elements, the first from A and the 
second from B. Informally, we often speak of ways, here. For example there are five 
roads (ways) between cities I and II and there are ten roads (ways) between cities 
II and III. Hence, there are 5 * 10 = 50 ways to get from city I to city III by going 
from city I to city II and then from city II to city III. This rule extends immediately 
to more than two sets. For instance, suppose in a certain state that driver license 
plates have the pattern of three letters followed by three numbers. Then there are 
26° « 10° possible license plates in this state. 

Next, let A be a set with n elements. Suppose we are interested in k-tuples 
whose components are elements of A. Then by the extended mn rule, there are 
n-n---n=n* such k-tuples whose components are elements of A. Next, suppose 
k <n and we are interested in k-tuples whose components are distinct (no repeats) 
elements of A. There are n elements from which to choose for the first component, 
n—1 for the second component, ..., 7 —(k—1) for the kth. Hence, by the mn rule, 
there are n(n — 1)---(n — (k — 1)) such k-tuples with distinct elements. We call 
each such k-tuple a permutation and use the symbol P;’ to denote the number of 
k permutations taken from a set of n elements. This number of permutations, P;’ 
is our second counting rule. We can rewrite it as 


n! 


fe ee 2 Ee a 


(1.3.5) 


Example 1.3.3 (Birthday Problem). Suppose there are n people in a room. As- 
sume that n < 365 and that the people are unrelated in any way. Find the proba- 
bility of the event A that at least 2 people have the same birthday. For convenience, 
assign the numbers 1 though n to the people in the room. Then use n-tuples to 
denote the birthdays of the first person through the nth person in the room. Using 
the mn-rule, there are 365” possible birthday n-tuples for these n people. This 
is the number of elements in the sample space. Now assume that birthdays are 
equilikely to occur on any of the 365 days. Hence, each of these n-tuples has prob- 
ability 365—”. Notice that the complement of A is the event that all the birthdays 
in the room are distinct; that is, the number of n-tuples in A° is P36°. Thus, the 
probability of A is 


For instance, if n = 2 then P(A) = 1 — (365 « 364)/(3652) = 0.0027. This formula 
is not easy to compute by hand. The following R. function computes the P(A) for 
the input n and it can be downloaded at the sites mentioned in the Preface. 


4An R primer for the course is found in Appendix B. 
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bday = function(n){ bday = 1; nmi =n - 1 
for(j in 1:nm1){bday = bday*((365-j)/365) } 
bday <- 1 - bday; return(bday)} 


Assuming that the file bday.R contains this function, here is the R segment com- 
puting P(A) for n = 10: 

> source("bday.R") 

> bday (10) 

[1] 0.1169482 


For our last counting rule, as with permutations, we are drawing from a set A 
of n elements. Now, suppose order is not important, so instead of counting the 
number of permutations we want to count the number of subsets of k elements 
taken from A. We use the symbol (i) to denote the total number of these subsets. 
Consider a subset of k elements from A. By the permutation rule it generates 
PE = k(k —1)---1 = k! permutations. Furthermore, all these permutations are 
distinct from the permutations generated by other subsets of k elements from A. 
Finally, each permutation of k& distinct elements drawn from A must be generated 
by one of these subsets. Hence, we have shown that Py’ = (is that is, 


(;) > wor (1.3.6) 


We often use the terminology combinations instead of subsets. So we say that there 
are () combinations of k things taken from a set of n things. Another common 
symbol for (%) is C?. 

It is interesting to note that if we expand the binomial series, 


(a +b)" = (a+b)(a+b)--- (a+b), 
we get 


(1) ake, (1.3.7) 


because we can select the & factors from which to take a in es) ways. So ( is also 
referred to as a binomial coefficient. 


(a+b)"=)0 
k= 


0) 


Example 1.3.4 (Poker Hands). Let a card be drawn at random from an ordinary 
deck of 52 playing cards that has been well shuffled. The sample space C consists of 
52 elements, each element represents one and only one of the 52 cards. Because the 
deck has been well shuffled, it is reasonable to assume that each of these outcomes 
has the same probability =: Accordingly, if F, is the set of outcomes that are 
spades, P(E) = #3 = } because there are 13 spades in the deck; that is, } is the 
probability of drawing a card that is a spade. If Ey is the set of outcomes that 
are kings, P(E2) = 4 = 4 because there are 4 kings in the deck; that is, # is 
the probability of drawing a card that is a king. These computations are very easy 
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because there are no difficulties in the determination of the number of elements in 
each event. 

However, instead of drawing only one card, suppose that five cards are taken, 
at random and without replacement, from this deck; i.e, a five card poker hand. In 
this instance, order is not important. So a hand is a subset of five elements drawn 
from a set of 52 elements. Hence, by (1.3.6) there are () poker hands. If the 
deck is well shuffled, each hand should be equilikely; i.e., each hand has probability 
1/ eae We can now compute the probabilities of some interesting poker hands. Let 
t) = 4 suits 
to choose for the flush and in each suit there are () possible hands; hence, using 
the multiplication rule, the probability of getting a flush is 


() (5) _ 4: 1287 
(2): 2598960 


Real poker players note that this includes the probability of obtaining a straight 
flush. 

Next, consider the probability of the event E2 of getting exactly three of a kind, 
(the other two cards are distinct and are of different kinds). Choose the kind for 
the three, in ee) ways; choose the three, in es ways; choose the other two kinds, 
in (7) ways; and choose one card from each of these last two kinds, in (7) G 


2 1 
Hence the probability of exactly three of a kind is 


13) (4) (12) (4)2 
P(E.) = G)@G) — 0.0211. 
(5) 
Now suppose that E3 is the set of outcomes in which exactly three cards are 
kings and exactly two cards are queens. Select the kings, in (}) ways, and select 


FE, be the event of a flush, all five cards of the same suit. There are ( 


P(E.) = = 0.00198. 


) ways. 


the queens, in (3) ways. Hence, the probability of F3 is 


P(E3) = (3) (3) / (*) = 0.0000093. 


The event £3 is an example of a full house: three of one kind and two of another 
kind. Exercise 1.3.19 asks for the determination of the probability of a full house. 
| 


1.3.2 Additional Properties of Probability 


We end this section with several additional properties of probability which prove 
useful in the sequel. Recall in Exercise 1.2.6 we said that a sequence of events 
{C,,} is a nondecreasing sequence if C;, C Cy+1, for all n, in which case we wrote 
limp—oo Cn = UL,Chn. Consider limn_.. P(C,). The question is: can we legiti- 
mately interchange the limit and P? As the following theorem shows, the answer 
is yes. The result also holds for a decreasing sequence of events. Because of this 
interchange, this theorem is sometimes referred to as the continuity theorem of 
probability. 
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Theorem 1.3.6. Let {C,,} be a nondecreasing sequence of events. Then 


Jim P(Cn) = P( lim Cn) -P(U Cc a): (1.3.8) 


n=1 


Let {C,,} be a decreasing sequence of events. Then 


jim P(Cn) = P( lim Cn) -»(f C .) (1.3.9) 


Proof. We prove the result (1.3.8) and leave the second result as Exercise 1.3.20. 
Define the sets, os “ as Ry = C, and, forn > 1, R, = C,NCE_,. It 
follows that UP, Cn = UP, Rn and that Rn A Ry» = ¢, for m # n. Also, 
P(Rn) = P(Ch )— P(C,-1). Applying the third axiom of probability yields the 
following string of equalities: 


P| lim C,| = P (U °,] =P (U n,) = 5 P(Ry) = sim So PUR ) 
= Jim PIC) + 1P(C) — PIC; vIle= lim P(C, ). (1.3.10) 


This is the desired result. m 


Another useful result for arbitrary unions is given by 


Theorem 1.3.7 (Boole’s Inequality). Let {C,} be an arbitrary sequence of events. 


Then - se 
P (U c,] => PC.) (1.3.11) 


n=1 n=1 


Proof: Let Dn = U;_, Ci. Then {D,} is an increasing sequence of events that go 
up to UP, Cn. Also, fon all 7, Dj = Dj-1 UC;. Hence, by Theorem 1.3.5, 


P(D;) < P(Dj-1) + P(G5), 


that is, 
PLDs) = PLDs) = PCy): 


In this case, the Cis are replaced by the Djs in expression (1.3.10). Hence, using 
the above inequality in this expression and the fact that P(C,) = P(D1), we have 


P (U °,| P (U D,) = tim } PDs) + IPD) — PD}-2) 


n=1 


aim, > P(C)) = > P(Cn). a 


n=1 


IA 
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Theorem 1.3.5 gave a general additive law of probability for the union of two 
events. As the next remark shows, this can be extended to an additive law for an 
arbitrary union. 


Remark 1.3.2 (Inclusion Exclusion Formula). It is easy to show (Exercise 1.3.9) 
that 
P(C, UC2U C3) = pi — pa + Pa, 


where 
yi = P(C)) + P(C2) + P(C3) 
p2 = P(Cy AN Cg) + P(C, A C3) + P(C29 C3) 
ps = P(C{NC2NCs). (1.3.12) 


This can be generalized to the inclusion exclusion formula: 


P(C, UC) U---U Ck) = pi — po + ps3 — +++ + ( 1)"**p,, (1.3.13) 


where p; equals the sum of the probabilities of all possible intersections involving i 
sets. 

When & = 8, it follows that p, > po > ps, but more generally p; > po >--- > pr. 
As shown in Theorem 1.3.7, 


py = P(C\) + P(C2) +--+ + P(Cy) > P(Ci UC2U-++-U Cx). 
For k = 2, we have 
1> P(Cy UC2) = P(C1) + P(C2) — P(C1 NC), 
which gives Bonferroni’s inequality, 
P(CLN C2) > P(C1) + P(C2) - 1, (1.3.14) 


that is only useful when P(C,) and P(C2) are large. The inclusion exclusion formula 
provides other inequalities that are useful, such as 


pi > P(C,UC2U---UCk) = pi — pe 


and 
pi — pot p3 > P(Cy UC2U---UCr) > pi —p2+p3— pa. O 


EXERCISES 


1.3.1. A positive integer from one to six is to be chosen by casting a die. Thus the 
elements c of the sample space C are 1,2,3,4,5,6. Suppose C, = {1,2,3,4} and 
C2 = {3,4,5,6}. If the probability set function P assigns a probability of - to each 
of the elements of C, compute P(C1), P(C2), P(Ci A C2), and P(C; UC%2). 


o] 
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1.3.2. A random experiment consists of drawing a card from an ordinary deck of 
52 playing cards. Let the probability set function P assign a probability of _ to 
each of the 52 possible outcomes. Let C denote the collection of the 13 hearts and 
let Cz denote the collection of the 4 kings. Compute P(C1), P(C2), P(Ci N C2), 


and P(Cy U C2). 


1.3.3. A coin is to be tossed as many times as necessary to turn up one head. 
Thus the elements c of the sample space C are H, TH, TTH, TTTH, and so 
forth. Let the probability set function P assign to these elements the respec- 
tive probabilities 4, +, z; a and so forth. Show that P(C) = 1. Let Cy = {c: 
cis H,TH,TTH,TTTH, or TTTTH}. Compute P(C)). Next, suppose that Cy = 
{c: cis TTTTH or TTTTTH}. Compute P(C2), P(C1 NM C2), and P(Ci U C2). 


d 


1.3.4. If the sample space is C = C; UC. and if P(C,) = 0.8 and P(C2) = 0.5, find 
P(CLN C2). 


1.3.5. Let the sample space be C = {c: 0 < ¢ < oo}. Let C CC be defined by 
C = {c:4<c< oo} and take P(C) = [,e~* dx. Show that P(C) = 1. Evaluate 
P(C), P(C°), and P(CUC*). 


1.3.6. If the sample space is C = {c: —oo < c < co} and if C CC isa set for which 
the integral [ C e—!*| dx exists, show that this set function is not a probability set 
function. What constant do we multiply the integrand by to make it a probability 
set function? 


1.3.7. If Cy and C2 are subsets of the sample space C, show that 
P(CL NA C2) < P(C1) < P(C1 U C2) < P(C)) + P(C2). 


1.3.8. Let C), Co, and C3 be three mutually disjoint subsets of the sample space 
C. Find Pl(Ch U C2) M C3] and P(CE U C$). 


1.3.9. Consider Remark 1.3.2. 


(a) If Ci, Co, and C3 are subsets of C, show that 


P(C, UC, UCS) = P(C,) t P(C3) } P(C3) P(C, NC) 
— P(C\y N C3) — P(C2N C3) + P(CLN C29 Cs). 


b) Now prove the general inclusion exclusion formula given by the expression 
8 


Remark 1.3.3. In order to solve Exercises (1.3.10)—(1.3.19), certain reasonable 
assumptions must be made. m 


1.3.10. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are blue. If 
four chips are taken at random and without replacement, find the probability that: 
(a) each of the four chips is red; (b) none of the four chips is red; (c) there is at 
least one chip of each color. 
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1.3.11. A person has purchased 10 of 1000 tickets sold in a certain raffle. To 
determine the five prize winners, five tickets are to be drawn at random and without 
replacement. Compute the probability that this person wins at least one prize. 
Hint: First compute the probability that the person does not win a prize. 


1.3.12. Compute the probability of being dealt at random and without replacement 
a 13-card bridge hand consisting of: (a) 6 spades, 4 hearts, 2 diamonds, and 1 club; 
(b) 13 cards of the same suit. 


1.3.13. Three distinct integers are chosen at random from the first 20 positive 
integers. Compute the probability that: (a) their sum is even; (b) their product is 
even. 


1.3.14. There are five red chips and three blue chips in a bowl. The red chips 
are numbered 1, 2, 3, 4, 5, respectively, and the blue chips are numbered 1, 2, 3, 
respectively. If two chips are to be drawn at random and without replacement, find 
the probability that these chips have either the same number or the same color. 


1.3.15. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector examines 
five bulbs, which are selected at random and without replacement. 


(a) Find the probability of at least one defective bulb among the five. 


(b) How many bulbs should be examined so that the probability of finding at least 
one bad bulb exceeds $? 


1.3.16. For the birthday problem, Example 1.3.3, use the given R function bday to 
determine the value of n so that p(n) > 0.5 and p(n — 1) < 0.5, where p(n) is the 
probability that at least two people in the room of n people have the same birthday. 


1.3.17. If Ci,...,C, are k events in the sample space C, show that the probability 
that at least one of the events occurs is one minus the probability that none of them 
occur; 1.€., 

P(C, U---U Ck) = 1— P(CZN---N Cx). (1.3.15) 


1.3.18. A secretary types three letters and the three corresponding envelopes. In 
a hurry, he places at random one letter in each envelope. What is the probability 
that at least one letter is in the correct envelope? Hint: Let Ci be the event that 
the ith letter is in the correct envelope. Expand P(C, U C2 U C3) to determine the 
probability. 


1.3.19. Consider poker hands drawn from a well-shuffled deck as described in Ex- 
ample 1.3.4. Determine the probability of a full house, i.e, three of one kind and 
two of another. 


1.3.20. Prove expression (1.3.9). 


1.3.21. Suppose the experiment is to choose a real number at random in the in- 
terval (0,1). For any subinterval (a,b) C (0,1), it seems reasonable to assign the 
probability P[(a, b)] = b—a; i.e., the probability of selecting the point from a subin- 
terval is directly proportional to the length of the subinterval. If this is the case, 
choose an appropriate sequence of subintervals and use expression (1.3.9) to show 
that P[{a}] = 0, for all a € (0,1). 
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1.3.22. Consider the events C1, C2, C3. 


(a) Suppose C1, C2,C3 are mutually exclusive events. If P(C;) = p;, i = 1, 2,3, 
what is the restriction on the sum p; + p2 + p3? 


(b) In the notation of part (a), if py = 4/10, po = 3/10, and p3 = 5/10, are 
C1, C2, C3 mutually exclusive? 


For the last two exercises it is assumed that the reader is familiar with o-fields. 


1.3.23. Suppose D is a nonempty collection of subsets of C. Consider the collection 
of events 
B=N{E: Dc Eand E is a o-field}. 


Note that ¢ € B because it is in each o-field, and, hence, in particular, it is in each 
o-field € > D. Continue in this way to show that B is a o-field. 


1.3.24. Let C = R, where R is the set of all real numbers. Let Z be the set of all 
open intervals in R. The Borel o-field on the real line is given by 


Bo = ME: Zc Eand € is a o-field}. 


By definition, By) contains the open intervals. Because [a,0o) = (—oo,a)° and Bo 
is closed under complements, it contains all intervals of the form [a, oo), for a € R. 
Continue in this way and show that 6p contains all the closed and half-open intervals 
of real numbers. 


1.4 Conditional Probability and Independence 


In some random experiments, we are interested only in those outcomes that are 
elements of a subset A of the sample space C. This means, for our purposes, that 
the sample space is effectively the subset A. We are now confronted with the 
problem of defining a probability set function with A as the “new” sample space. 

Let the probability set function P(A) be defined on the sample space C and let 
A be a subset of C such that P(A) > 0. We agree to consider only those outcomes 
of the random experiment that are elements of A; in essence, then, we take A 
to be a sample space. Let B be another subset of C. How, relative to the new 
sample space A, do we want to define the probability of the event B? Once defined, 
this probability is called the conditional probability of the event B, relative to the 
hypothesis of the event A, or, more briefly, the conditional probability of B, given 
A. Such a conditional probability is denoted by the symbol P(B|A). The “|” in this 
symbol is usually read as “given.” We now return to the question that was raised 
about the definition of this symbol. Since A is now the sample space, the only 
elements of B that concern us are those, if any, that are also elements of A, that 
is, the elements of AN B. It seems desirable, then, to define the symbol P(B|A) in 
such a way that 


P(A|A)=1 and P(BIA) = P(AN BIA). 
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Moreover, from a relative frequency point of view, it would seem logically incon- 
sistent if we did not require that the ratio of the probabilities of the events AM B 
and A, relative to the space A, be the same as the ratio of the probabilities of these 
events relative to the space C; that is, we should have 

P(ANB|A)  P(ANB) 


P(AJA) P(A) 


These three desirable conditions imply that the relation conditional probability is 
reasonably defined as 


Definition 1.4.1 (Conditional Probability). Let B and A be events with P(A) > 0. 
Then we defined the conditional probability of B given A as 


P(ANB) 


P(BIA) = Foy 


(1.4.1) 


Moreover, we have 

1. P(B\A) > 0. 

2. P(A|A) = 1. 

3. P(U%,Bn|A) = 3°, P(B,|A), provided that Bi, Bz,... are mutually ex- 
clusive events. 


Properties (1) and (2) are evident. For Property (3), suppose the sequence of 
events B,, Bo,... is mutually exclusive. It follows that also (B,NA)N(BmNA) = ¢, 
n#m. Using this and the first of the distributive laws (1.2.5) for countable unions, 
we have 


PlUn=1 (Bn  A)] 
P(A) 


2. P[Bn 1 A] 
= Dies 0 


P(Un=1 Bn|A) 


I 


n=1 


=. >» 2, |Al 


n=1 


Properties (1)—(3) are precisely the conditions that a probability set function must 
satisfy. Accordingly, P(B|A) is a probability set function, defined for subsets of A. 
It may be called the conditional probability set function, relative to the hypothesis 
A, or the conditional probability set function, given A. It should be noted that 
this conditional probability set function, given A, is defined at this time only when 
P(A) > 0. 


Example 1.4.1. A hand of five cards is to be dealt at random without replacement 
from an ordinary deck of 52 playing cards. The conditional probability of an all- 
spade hand (B), relative to the hypothesis that there are at least four spades in the 
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hand (A), is, since AN B = B, 
PB) 9/2) 
(A) (a) + C3) 03) 


(5) 
(2) @) + C3) 


P(BIA) = 


as) Bas) 
| 


= 0.0441. 


Note that this is not the same as drawing for a spade to complete a flush in draw 
poker; see Exercise 1.4.3. 


From the definition of the conditional probability set function, we observe that 
P(AN B) = P(A)P(BIA). 


This relation is frequently called the multiplication rule for probabilities. Some- 
times, after considering the nature of the random experiment, it is possible to make 
reasonable assumptions so that both P(A) and P(B|A) can be assigned. Then 
P(ANB) can be computed under these assumptions. This is illustrated in Exam- 
ples 1.4.2 and 1.4.3. 


Example 1.4.2. A bowl contains eight chips. Three of the chips are red and 
the remaining five are blue. Two chips are to be drawn successively, at random 
and without replacement. We want to compute the probability that the first draw 
results in a red chip (A) and that the second draw results in a blue chip (B). It is 
reasonable to assign the following probabilities: 


P(A)=% and P(BIA) = 8. 


Thus, under these assignments, we have P(A B) = (2)(3) = # = 0.2679. = 


Example 1.4.3. From an ordinary deck of playing cards, cards are to be drawn 
successively, at random and without replacement. The probability that the third 
spade appears on the sixth draw is computed as follows. Let A be the event of two 
spades in the first five draws and let B be the event of a spade on the sixth draw. 
Thus the probability that we wish to compute is P(AM B). It is reasonable to take 


(3)G) U1 

P(A) = “a = 0.2743 and P(B\A) = = = 0.2340. 
5 

The desired probability P(ANM B) is then the product of these two numbers, which 

to four places is 0.0642. m 


The multiplication rule can be extended to three or more events. In the case of 
three events, we have, by using the multiplication rule for two events, 


P(ANBNC) = PIANB)NC 
P(AN B)P(C\|AN B). 
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But P(ANM B) = P(A)P(B\A). Hence, provided P(AN B) > 0, 
P(AN BNC) = P(A)P(BIA)P(C|AN B). 


This procedure can be used to extend the multiplication rule to four or more 
events. The general formula for k events can be proved by mathematical induction. 


Example 1.4.4. Four cards are to be dealt successively, at random and without 
replacement, from an ordinary deck of playing cards. The probability of receiving a 
spade, a heart, a diamond, and a club, in that order, is (33)(#)(#)(#) = 0.0044. 
This follows from the extension of the multiplication rule. 


Consider & mutually exclusive and exhaustive events A;, A2,..., A, such that 
P(A;) > 0,1 =1,2,...,k; ie, Ai, Ao,..., A, form a partition of C. Here the events 
Aj, Ag,..., Ax do not need to be equally likely. Let B be another event such that 
P(B) > 0. Thus B occurs with one and only one of the events Ai, A2,..., Ag; that 
is, 


B = BN (A,UAQU--- Ag) 
= (BN A,)U(BN Ag) U--:-U(BN Ag). 
Since BN Aj, i = 1,2,...,&, are mutually exclusive, we have 
P(B) = P(BN Aj) + P(BN Ag) +-+- + P(BN Ax). 
However, P(BNM A;) = P(A;)P(B\A;), i =1,2,...,k; so 
P(B) = P(A,)P(BJA1) + P(A2)P(Bl|A2) +--+ P(Ag)P(BIArx) 
= 57 P(A) PUBIAN) (1.4.2) 
i=1 


This result is sometimes called the law of total probability and it leads to the 
following important theorem. 


Theorem 1.4.1 (Bayes). Let A,,A2,..., An be events such that P(A;) > 0, i = 
1,2,...,k. Assume further that A,, A2,..., Ax form a partition of the sample space 
C. Let B be any event. Then 

P(Aj)P(BIA;) 
Dyer P(Ai) P(BIAs) 
Proof: Based on the definition of conditional probability, we have 


P(BN Aj) _ P(Aj)P(BIA;) 
P(B) PIB) 


The result then follows by the law of total probability, (1.4.2). m 

This theorem is the well-known Bayes’ Theorem. This permits us to calculate 
the conditional probability of A;, given B, from the probabilities of Aj, Ag,..., Ax 
and the conditional probabilities of B, given A;, 7 = 1,2,...,k. The next three 
examples illustrate the usefulness of Bayes Theorem to determine probabilities. 


P(A;|B) = (1.4.3) 


P(A;|B) = 
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Example 1.4.5. Say it is known that bowl A; contains three red and seven blue 
chips and bowl Az contains eight red and two blue chips. All chips are identical 
in size and shape. A die is cast and bowl A, is selected if five or six spots show 
on the side that is up; otherwise, bowl Ag is selected. For this situation, it seems 
reasonable to assign P(A;) = 2 and P(A2) = %. The selected bowl is handed to 
another person and one chip is taken at random. Say that this chip is red, an event 
which we denote by B. By considering the contents of the bowls, it is reasonable 
to assign the conditional probabilities P(B|A1) = 4 and P(B|A2) = &. Thus the 
conditional probability of bowl A, given that a red chip is drawn, is 


_ P(Ai)P(BIA1) 
PIB) = BCEPBIAi) + P(A2)PB|As) 
___ @4 3 
(3)(a5) + (@)(qs) 19 
In a similar manner, we have P(A2|B) = 78. = 


In Example 1.4.5, the probabilities P(A1) = 2 and P(A) = - are called prior 
probabilities of A, and Ag, respectively, because they are known to be due to the 
random mechanism used to select the bowls. After the chip is taken and is observed 
to be red, the conditional probabilities P(Ai|B) = 4 and P(A2|B) = 7 are called 
posterior probabilities. Since A has a larger proportion of red chips than does 
Aj, it appeals to one’s intuition that P(A2|B) should be larger than P(A2) and, 
of course, P(A,|B) should be smaller than P(A;). That is, intuitively the chances 
of having bowl Ag are better once that a red chip is observed than before a chip 
is taken. Bayes’ theorem provides a method of determining exactly what those 
probabilities are. 


Example 1.4.6. Three plants, A;, Az, and A3, produce respectively, 10%, 50%, 
and 40% of a company’s output. Although plant A, is a small plant, its manager 
believes in high quality and only 1% of its products are defective. The other two, Ag 
and A3, are worse and produce items that are 3% and 4% defective, respectively. 
All products are sent to a central warehouse. One item is selected at random 
and observed to be defective, say event B. The conditional probability that it 
comes from plant A, is found as follows. It is natural to assign the respective prior 
probabilities of getting an item from the plants as P(A;) = 0.1, P(A) = 0.5, and 
P(A 3) = 0.4, while the conditional probabilities of defective items are P(B|A,) = 
0.01, P(B| Az) = 0.03, and P(B|A3) = 0.04. Thus the posterior probability of Aj, 
given a defective, is 


— P(A, B) (0.10)(0.01) 1 
Sc as P(B) __(0.1)(0.01) + (0.5)(0.03) + (0.4)(0.04) 32 


This is much smaller than the prior probability P(A1) = a: This is as it should be 
because the fact that the item is defective decreases the chances that it comes from 
the high-quality plant Ai. m 
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Example 1.4.7. Suppose we want to investigate the percentage of abused children 
in a certain population. The events of interest are: a child is abused (A) and its 
complement a child is not abused (N = A‘). For the purposes of this example, we 
assume that P(A) = 0.01 and, hence, P(N) = 0.99. The classification as to whether 
a child is abused or not is based upon a doctor’s examination. Because doctors are 
not perfect, they sometimes classify an abused child (A) as one that is not abused 
(Np, where Np means classified as not abused by a doctor). On the other hand, 
doctors sometimes classify a nonabused child (N) as abused (Ap). Suppose these 
error rates of misclassification are P(Np|A) = 0.04 and P(Ap|N) = 0.05; thus 
the probabilities of correct decisions are P(Ap | A) = 0.96 and P(Np|N) = 0.95. 
Let us compute the probability that a child taken at random is classified as abused 
by a doctor. Because this can happen in two ways, AN Ap or NM Ap, we have 


P(Ap) = P(Ap|A)P(A) + P(Ap | N)P(N) = (0.96)(0.01) + (0.05) (0.99) = 0.0591, 


which is quite high relative to the probability of an abused child, 0.01. Further, the 
probability that a child is abused when the doctor classified the child as abused is 
P(ANAp) — (0.96)(0.01) 

P(A| Ap) = ———— = ————_ = 0.1624 

Al Ap) P(Ap) 0.0591 
which is quite low. In the same way, the probability that a child is not abused 
when the doctor classified the child as abused is 0.8376, which is quite high. The 
reason that these probabilities are so poor at recording the true situation is that the 
doctors’ error rates are so high relative to the fraction 0.01 of the population that 
is abused. An investigation such as this would, hopefully, lead to better training of 

doctors for classifying abused children. See also Exercise 1.4.17. 


1.4.1 Independence 


Sometimes it happens that the occurrence of event A does not change the probability 
of event B; that is, when P(A) > 0, 


P(BIA) = P(B). 


In this case, we say that the events A and B are independent. Moreover, the 
multiplication rule becomes 
P(AN B) = P(A)P(B|A) = P(A)P(B). (1.4.4) 
This, in turn, implies, when P(B) > 0, that 
P(ANB)  P(A)P(B) 
P(A|B) = ——_ = ———— _= P(A). 


Note that if P(A) > 0 and P(B) > 0, then by the above discussion, independence 
is equivalent to 
P(AN B) = P(A)P(B). (1.4.5) 
What if either P(A) = 0 or P(B) = 0? In either case, the right side of (1.4.5) is 0. 
However, the left side is 0 also because AN. BC A and ANB C B. Hence, we take 
Equation (1.4.5) as our formal definition of independence; that is, 
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Definition 1.4.2. Let A and B be two events. We say that A and B are inde- 
pendent if P(AN B) = P(A)P(B). a 


Suppose A and B are independent events. Then the following three pairs of 
events are independent: A® and B, A and B°, and A° and B°. We show the first 
and leave the other two to the exercises; see Exercise 1.4.11. Using the disjoint 
union, B = (A°N B) U(AN B), we have 


P(A°NB) = P(B)—P(ANB) = P(B)—P(A)P(B) = [1—P(A)|P(B) = P(AS)P(B). 
(1.4.6) 
Hence, A° and B are also independent. 


Remark 1.4.1. Events that are independent are sometimes called statistically in- 
dependent, stochastically independent, or independent in a probability sense. In 
most instances, we use independent without a modifier if there is no possibility of 
misunderstanding. ml 


Example 1.4.8. A red die and a white die are cast in such a way that the numbers 
of spots on the two sides that are up are independent events. If A represents a 


four on the red die and B represents a three on the white die, with an equally 


likely assumption for each side, we assign P(A) = % and P(B) = 3. Thus, from 


6° 
independence, the probability of the ordered pair (red = 4, white = 3) is 
P((4,3)] = ()(§) = ge: 
The probability that the sum of the up spots of the two dice equals seven is 
P((1,6), (2,5), (3,4), (4,3), (5, 2), (6, 1) 
= (3) (a) + (a) Ce) + Ce) (a) + (a) (3) + Ce) (6) + CG) (8) = a- 
In a similar manner, it is easy to show that the probabilities of the sums of the 


upfaces 2,3, 4,5,6,7,8,9, 10,11, 12 are, respectively, 


Suppose now that we have three events, A,, A2, and As. We say that they are 
mutually independent if and only if they are pairwise independent: 
P(A, A3) = P(A1)P(A3), P(A M Ag) = P(A1)P(Ag), 
P(A2M A3) = P(A2)P(As), 
and 
P(A, M Ag N As) = P(A1)P(A2)P(A3). 


More generally, the n events A,, Ao,..., An are mutually independent if and only 
if for every collection of k of these events, 2 < k < n, and for every permutation 
dy,dz,..., dx of Te Dcacag hs 


P(Aa, NM Aas M +++ Aa,) _ P(Aa,)P(Aa,)++: P(Aa,)- 
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In particular, if A,, A2,..., A, are mutually independent, then 
P(A, N AgN-++O An) = P(A1)P(A2)-+- P(An). 


Also, as with two sets, many combinations of these events and their complements 
are independent, such as 


1. The events Af and Ag U A§ U Ay are independent, 
2. The events A; U AS , A3 and Ay A§ are mutually independent. 


If there is no possibility of misunderstanding, independent is often used without the 
modifier mutually when considering more than two events. 


Example 1.4.9. Pairwise independence does not imply mutual independence. As 
an example, suppose we twice spin a fair spinner with the numbers 1, 2, 3, and 4. 
Let A; be the event that the sum of the numbers spun is 5, let Ag be the event that 
the first number spun is a 1, and let A3 be the event that the second number spun 
isa 4. Then P(A;) = 1/4, 1 = 1,2,3, and for i 4 j, P(A; A;) = 1/16. So the 
three events are pairwise independent. But A, M Ag /M Az is the event that (1,4) is 
spun, which has probability 1/16 4 1/64 = P(A,)P(A2)P(A3). Hence the events 
A,, Ag, and A3 are not mutually independent. m 


We often perform a sequence of random experiments in such a way that the 
events associated with one of them are independent of the events associated with 
the others. For convenience, we refer to these events as as outcomes of independent 
experiments, meaning that the respective events are independent. Thus we often 
refer to independent flips of a coin or independent casts of a die or, more generally, 
independent trials of some given random experiment. 


Example 1.4.10. A coin is flipped independently several times. Let the event A; 
represent a head (H) on the ith toss; thus A§ represents a tail (T). Assume that A; 
and A§ are equally likely; that is, P(A;) = P(A$) = 5. Thus the probability of an 
ordered sequence like HHTH is, from independence, 


P(A, A2N ASN Aa) = P(A1)P(A2)P(AS)P(As) = (5)* a a 
Similarly, the probability of observing the first head on the third flip is 


P(AgM ASN As) = P(A§) P(A§)P(As) = (3)? = 


Ole 


Also, the probability of getting at least one head on four flips is 


P(A, U Ag U A3 U Ag) = 1 — P[(Ay U Ag U Ag U Ag) 
1— P(ASN ASN ASN AS) 


4 
= 1-0)'=# 


See Exercise 1.4.13 to justify this last probability. m 
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Example 1.4.11. A computer system is built so that if component Ky, fails, it is 
bypassed and K2 is used. If Ko fails, then K3 is used. Suppose that the probability 
that Ky fails is 0.01, that kK fails is 0.03, and that K3 fails is 0.08. Moreover, we 
can assume that the failures are mutually independent events. Then the probability 
of failure of the system is 


(0.01) (0.03)(0.08) = 0.000024, 


as all three components would have to fail. Hence, the probability that the system 
does not fail is 1 — 0.000024 = 0.999976. 


1.4.2 Simulations 


Many of the exercises at the end of this section are designed to aid the reader in 
his/her understanding of the concepts of conditional probability and independence. 
With diligence and patience, the reader will derive the exact answer. Many real 
life problems, though, are too complicated to allow for exact derivation. In such 
cases, scientists often turn to computer simulations to estimate the answer. As an 
example, suppose for an experiment, we want to obtain P(A) for some event A. 
A program is written that performs one trial (one simulation) of the experiment 
and it records whether or not A occurs. We then obtain n independent simulations 
(runs) of the program. Denote by p,, the proportion of these n simulations in which 
A occurred. Then #,, is our estimate of the P(A). Besides the estimation of P(A), 
we also obtain an error of estimation given by 1.96 * \/p,(1 — p,)/n. As we discuss 
theoretically in Chapter 4, we are 95% confident that P(A) lies in the interval 


Dn, 1 oa Dn, 
Bn £1.96,/ Pet = Pn) (1.4.7) 
a) 


In Chapter 4, we call this interval a 95% confidence interval for P(A). For now, 
we make use of this confidence interval for our simulations. 


Example 1.4.12. As an example, consider the game: 


Person A tosses a coin and then person B rolls a die. This is repeated 
independently until a head or one of the numbers 1,2,3,4 appears, at 
which time the game is stopped. Person A wins with the head and B 
wins with one of the numbers 1,2,3,4. Compute the probability P(A) 
that person A wins the game. 


For an exact derivation, notice that it is implicit in the statement A wins the game 
that the game is completed. Using abbreviated notation, the game is completed if 
H or T{1,...,4} occurs. Using independence, the probability that A wins is thus 
the conditional probability (1/2)/[(1/2) + (1/2)(4/6)] = 3/5. 

The following R function, abgame, simulates the problem. This function can be 
downloaded and sourced at the site discussed in the Preface. The first line of the 
program sets up the draws for persons A and B, respectively. The second line sets 
up a flag for the while loop and the returning values, Awin and Bwin are initialized 
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at 0. The command sample(rngA,1,pr=pA) draws a sample of size 1 from rngA 
with pmf pA. Each execution of the while loop returns one complete game. Further, 
the executions are independent of one another. 


abgame <- function(){ 
rngA <- c(0,1); pA <- rep(1/2,2); rngB <- 1:6; pB <- rep(1/6,6) 
ic <- 0; Awin <- 0; Bwin <- 0 
while(ic == 0){ 
x <- sample(rngA,1,pr=pA) 
if (x==1){ 
ic <- 1; Awin <- 1 
} else { 
y <- sample (rngB,1,pr=pB) 
if(y <= 4){ic <- 1; Bwin <- 1} 
} 
} 
return(c(Awin,Bwin) ) 


i 


Notice that one and only one of Awin or Bwin receives the value 1 depending on 
whether or not A or B wins. The next R segment simulates the game 10,000 times 
and computes the estimate that A wins along with the error of estimation. 


ind <- 0; nsims <- 10000 
for(i in 1i:nsims){ 
seeA <- abgame () 
if(seeA[1] == 1){ind <- ind + 1} 
} 
estpA <- ind/nsims 
err <- 1.96*sqrt (estpA*(1-estpA) /nsims) 
estpA; err 


An execution of this code resulted in estpA = 0.6001 and err = 0.0096. As noted 
above the probability that A wins is 0.6 which is in the interval 0.6001 + 0.0096. As 
discussed in Chapter 4, we expect this to occur 95% of the time when using such a 
confidence interval. m 


EXERCISES 

1.4.1. If P(A) > 0 and if Ag, As, Ag,... are mutually disjoint sets, show that 
P(A U Ag U-+-|A1) = P(Ag|A1) + P(A3|A1) +---. 

1.4.2. Assume that P(A, M Ag A3) > 0. Prove that 


P(A Ag 1.A3 Aa) = P(A1)P(Ag|A1)P(A3|A1 9 Ag) P(Aa|A1 9 Ag M As). 
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1.4.3. Suppose we are playing draw poker. We are dealt (from a well-shuffled deck) 
five cards, which contain four spades and another card of a different suit. We decide 
to discard the card of a different suit and draw one card from the remaining cards 
to complete a flush in spades (all five cards spades). Determine the probability of 
completing the flush. 


1.4.4. From a well-shuffled deck of ordinary playing cards, four cards are turned 
over one at a time without replacement. What is the probability that the spades 
and red cards alternate? 


1.4.5. A hand of 13 cards is to be dealt at random and without replacement from 
an ordinary deck of playing cards. Find the conditional probability that there are 
at least three kings in the hand given that the hand contains at least two kings. 


1.4.6. A drawer contains eight different pairs of socks. If six socks are taken at 
random and without replacement, compute the probability that there is at least one 
matching pair among these six socks. Hint: Compute the probability that there is 
not a matching pair. 


1.4.7. A pair of dice is cast until either the sum of seven or eight appears. 
(a) Show that the probability of a seven before an eight is 6/11. 


(b) Next, this pair of dice is cast until a seven appears twice or until each of a 
six and eight has appeared at least once. Show that the probability of the six 
and eight occurring before two sevens is 0.546. 


1.4.8. In a certain factory, machines I, II, and III are all producing springs of the 
same length. Machines I, II, and III produce 1%, 4%, and 2% defective springs, 
respectively. Of the total production of springs in the factory, Machine I produces 
30%, Machine II produces 25%, and Machine III produces 45%. 


(a) If one spring is selected at random from the total springs produced in a given 
day, determine the probability that it is defective. 


(b) Given that the selected spring is defective, find the conditional probability 
that it was produced by Machine II. 


1.4.9. Bowl I contains six red chips and four blue chips. Five of these 10 chips 
are selected at random and without replacement and put in bowl II, which was 
originally empty. One chip is then drawn at random from bow! II. Given that this 
chip is blue, find the conditional probability that two red chips and three blue chips 
are transferred from bowl I to bowl II. 


1.4.10. In an office there are two boxes of thumb drives: Box A; contains seven 100 
GB drives and three 500 GB drives, and box A» contains two 100 GB drives and 
eight 500 GB drives. A person is handed a box at random with prior probabilities 
P(A,) = 3 and P(A2) = $, possibly due to the boxes’ respective locations. A drive 
is then selected at random and the event B occurs if it is a 500 GB drive. Using an 
equally likely assumption for each drive in the selected box, compute P(A;|B) and 
P(Ap2|B). 
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1.4.11. Suppose A and B are independent events. In expression (1.4.6) we showed 
that A° and B are independent events. Show similarly that the following pairs of 
events are also independent: (a) A and B¢ and (b) A and B°. 


1.4.12. Let C, and C2 be independent events with P(C;) = 0.6 and P(C2) = 
Compute (a) P(Cy NM C2), (b) P(Cy U C2), and (c) P(C, U C$). 


1.4.13. Generalize Exercise 1.2.5 to obtain 
(Cy UCg U-+-UCg)® = CY NCSN-- ACE. 


Say that Cl, C2,...,C, are independent events that have respective probabilities 
Pi; P2,---;Pk- Argue that the probability of at least one of C1, C2,...,C is equal 
to 


1 (1 = py)(1 —pa)+<- (1 — pp). 


1.4.14. Each of four persons fires one shot at a target. Let C;, denote the event that 
the target is hit by person k, k = 1,2,3,4. If Cy,C2,C3,C, are independent and 
if P(C,) = P(C2) = 0.7, P(C3) = 0.9, and P(C,) = 0.4, compute the probability 
that (a) all of them hit the target; (b) exactly one hits the target; (c) no one hits 
the target; (d) at least one hits the target. 


1.4.15. A bowl contains three red (R) balls and seven white (W) balls of exactly 
the same size and shape. Select balls successively at random and with replacement 
so that the events of white on the first trial, white on the second, and so on, can be 
assumed to be independent. In four trials, make certain assumptions and compute 
the probabilities of the following ordered sequences: (a) WWRW; (b) RWWW;; (c) 
WWWR; and (d) WRWW. Compute the probability of exactly one red ball in the 
four trials. 


1.4.16. A coin is tossed two independent times, each resulting in a tail (T) or a head 
(H). The sample space consists of four ordered pairs: TT, TH, HT, HH. Making 
certain assumptions, compute the probability of each of these ordered pairs. What 
is the probability of at least one head? 


1.4.17. For Example 1.4.7, obtain the following probabilities. Explain what they 
mean in terms of the problem. 


D).- 


(a) P(N 
(b) P(N | Ap). 
(c) P(A| Np). 
(d) P(N |Np). 


1.4.18. A die is cast independently until the first 6 appears. If the casting stops 
on an odd number of times, Bob wins; otherwise, Joe wins. 


(a) Assuming the die is fair, what is the probability that Bob wins? 
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(b) Let p denote the probability of a 6. Show that the game favors Bob, for all p, 
O0<p<l. 


1.4.19. Cards are drawn at random and with replacement from an ordinary deck 
of 52 cards until a spade appears. 


(a) What is the probability that at least four draws are necessary? 
(b) Same as part (a), except the cards are drawn without replacement. 


1.4.20. A person answers each of two multiple choice questions at random. If there 
are four possible choices on each question, what is the conditional probability that 
both answers are correct given that at least one is correct? 


1.4.21. Suppose a fair 6-sided die is rolled six independent times. A match occurs 
if side 7 is observed on the ith trial, i =1,...,6. 


(a) What is the probability of at least one match on the six rolls? Hint: Let C; 
be the event of a match on the 7th trial and use Exercise 1.4.13 to determine 
the desired probability. 


(b) Extend part (a) to a fair n-sided die with n independent rolls. Then determine 
the limit of the probability as n — oo. 


1.4.22. Players A and B play a sequence of independent games. Player A throws 
a die first and wins on a “six.” If he fails, B throws and wins on a “five” or “six.” 
If he fails, A throws and wins on a “four,” “five,” or “six.” And so on. Find the 
probability of each player winning the sequence. 


1.4.23. Let Ci, C2, C3 be independent events with probabilities 3, 3, 7, respec- 


tively. Compute P(C, U C2 U C3). 


1.4.24. From a bowl containing five red, three white, and seven blue chips, select 
four at random and without replacement. Compute the conditional probability of 
one red, zero white, and three blue chips, given that there are at least three blue 
chips in this sample of four chips. 


1.4.25. Let the three mutually independent events Ci, C2, and C3 be such that 


1.4.26. Each bag in a large box contains 25 tulip bulbs. It is known that 60% of 
the bags contain bulbs for 5 red and 20 yellow tulips, while the remaining 40% of 
the bags contain bulbs for 15 red and 10 yellow tulips. A bag is selected at random 
and a bulb taken at random from this bag is planted. 


(a) What is the probability that it will be a yellow tulip? 


(b) Given that it is yellow, what is the conditional probability it comes from a 
bag that contained 5 red and 20 yellow bulbs? 
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1.4.27. The following game is played. The player randomly draws from the set of 
integers {1,2,...,20}. Let a denote the number drawn. Next the player draws at 
random from the set {z,...,25}. If on this second draw, he draws a number greater 
than 21 he wins; otherwise, he loses. 


(a) Determine the sum that gives the probability that the player wins. 


(b) Write and run a line of R code that computes the probability that the player 
wins. 


(c) Write an R function that simulates the game and returns whether or not the 
player wins. 


d) Do 10,000 simulations of your program in Part (c). Obtain the estimate and 
y' & 
confidence interval, (1.4.7), for the probability that the player wins. Does 
your interval trap the true probability? 


1.4.28. A bowl contains 10 chips numbered 1,2,...,10, respectively. Five chips are 
drawn at random, one at a time, and without replacement. What is the probability 
that two even-numbered chips are drawn and they occur on even-numbered draws? 


1.4.29. A person bets 1 dollar to 6 dollars that he can draw two cards from an 
ordinary deck of cards without replacement and that they will be of the same suit. 
Find 6 so that the bet is fair. 


1.4.30 (Monte Hall Problem). Suppose there are three curtains. Behind one curtain 
there is a nice prize, while behind the other two there are worthless prizes. A 
contestant selects one curtain at random, and then Monte Hall opens one of the 
other two curtains to reveal a worthless prize. Hall then expresses the willingness 
to trade the curtain that the contestant has chosen for the other curtain that has 
not been opened. Should the contestant switch curtains or stick with the one that 
she has? To answer the question, determine the probability that she wins the prize 
if she switches. 


1.4.31. A French nobleman, Chevalier de Méré, had asked a famous mathematician, 
Pascal, to explain why the following two probabilities were different (the difference 
had been noted from playing the game many times): (1) at least one six in four 
independent casts of a six-sided die; (2) at least a pair of sixes in 24 independent 
casts of a pair of dice. From proportions it seemed to de Méré that the probabilities 
should be the same. Compute the probabilities of (1) and (2). 


1.4.32. Hunters A and B shoot at a target; the probabilities of hitting the target 
are p, and po, respectively. Assuming independence, can p; and pz be selected so 
that 

P(zero hits) = P(one hit) = P(two hits)? 


1.4.33. At the beginning of a study of individuals, 15% were classified as heavy 
smokers, 30% were classified as light smokers, and 55% were classified as nonsmok- 
ers. In the five-year study, it was determined that the death rates of the heavy and 
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light smokers were five and three times that of the nonsmokers, respectively. A ran- 
domly selected participant died over the five-year period: calculate the probability 
that the participant was a nonsmoker. 


1.4.34. A chemist wishes to detect an impurity in a certain compound that she is 
making. There is a test that detects an impurity with probability 0.90; however, 
this test indicates that an impurity is there when it is not about 5% of the time. 
The chemist produces compounds with the impurity about 20% of the time. A 
compound is selected at random from the chemist’s output. The test indicates that 
an impurity is present. What is the conditional probability that the compound 
actually has the impurity? 


1.5 Random Variables 


The reader perceives that a sample space C may be tedious to describe if the elements 
of C are not numbers. We now discuss how we may formulate a rule, or a set of 
rules, by which the elements c of C may be represented by numbers. We begin the 
discussion with a very simple example. Let the random experiment be the toss of 
a coin and let the sample space associated with the experiment be C = {H,T}, 
where H and T represent heads and tails, respectively. Let X be a function such 
that X(T) = 0 and X(H) = 1. Thus X is a real-valued function defined on the 
sample space C which takes us from the sample space C to a space of real numbers 
D = {0,1}. We now formulate the definition of a random variable and its space. 


Definition 1.5.1. Consider a random experiment with a sample space C. A func- 
tion X, which assigns to each element c € C one and only one number X(c) = a, is 
called a random variable. The space or range of X is the set of real numbers 
D={x:r=X(c),cEC}. mw 


In this text, D generally is a countable set or an interval of real numbers. We call 
random variables of the first type discrete random variables, while we call those of 
the second type continuous random variables. In this section, we present examples 
of discrete and continuous random variables and then in the next two sections we 
discuss them separately. 

Given a random variable X, its range D becomes the sample space of interest. 
Besides inducing the sample space D, X also induces a probability which we call 
the distribution of X. 

Consider first the case where X is a discrete random variable with a finite space 
D = {di,...,dm}. The only events of interest in the new sample space D are subsets 
of D. The induced probability distribution of X is also clear. Define the function 
px(di) on D by 


px(d;) = Pi{c: X(c) =d;}], fori =1,...,m. (1.5.1) 


In the next section, we formally define px(d;) as the probability mass function 
(pmf) of X. Then the induced probability distribution, Px(-), of X is 


Px(D) = S> px(di), DCD. 
d;ED 


38 Probability and Distributions 


As Exercise 1.5.11 shows, Px (D) is a probability on D. An example is helpful here. 


Example 1.5.1 (First Roll in the Game of Craps). Let X be the sum of the 
upfaces on a roll of a pair of fair 6-sided dice, each with the numbers 1 through 6 
on it. The sample space is C = {(i,j) : 1 < i,7 < 6}. Because the dice are fair, 
P{{ (i, 7)}] = 1/36. The random variable X is X(i,7) = i+. The space of X is 
D = {2,...,12}. By enumeration, the pmf of X is given by 


Praewe = [2] [*[e] a] 7] * [> [ona] 


a8 1 2 3 4 5 6 5 4 3 2 1 
Probability px (c) 


To illustrate the computation of probabilities concerning X, suppose B, = {x: x = 
7,11} and By = {x : x = 2,3,12}. Then, using the values of px(a) given in the 
table, 


6 2 8 
Py(Bi) =D pve) = RHE 
zEB, 
I! 2 1 4 
Px (Bo) = De re ae i 
ce Bo 


The second case is when X is a continuous random variable. In this case, D 
is an interval of real numbers. In practice, continuous random variables are often 
measurements. For example, the weight of an adult is modeled by a continuous 
random variable. Here we would not be interested in the probability that a person 
weighs exactly 200 pounds, but we may be interested in the probability that a 
person weighs over 200 pounds. Generally, for the continuous random variables, 
the simple events of interest are intervals. We can usually determine a nonnegative 
function fx(#) such that for any interval of real numbers (a,b) € D, the induced 
probability distribution of X, Px(-), is defined as 


b 
Px|(a,b)] = Pi{eeC:a< X(c) < b}] = / fx (x) da; (1.5.2) 


that is, the probability that X falls between a and b is the area under the curve 
y = fx(x) between a and b. Besides fx(x) > 0, we also require that Py(D) = 
Jp fx(a) dx = 1 (total area under the curve over the sample space of X is 1). There 
are some technical issues in defining events in general for the space D; however, it 
can be shown that Px(D) is a probability on D; see Exercise 1.5.11. The function 
fx is formally defined as the probability density function (pdf) of X in Section 
1.7. An example is in order. 


Example 1.5.2. For an example of a continuous random variable, consider the 
following simple experiment: choose a real number at random from the interval 
(0,1). Let X be the number chosen. In this case the space of X is D = (0,1). It is 
not obvious as it was in the last example what the induced probability Px is. But 
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there are some intuitive probabilities. For instance, because the number is chosen 
at random, it is reasonable to assign 


Px|(a,b)] =b-—a, forO<a<b<1. (1.5.3) 
It follows that the pdf of X is 


fx(z) = { 


For example, the probability that X is less than an eighth or greater than seven 


eighths is 
1 
1 8 1 1 
pllgetlolgs! = fe acs | jee, 
8 8 : z 4 


Notice that a discrete probability model is not a possibility for this experiment. For 
any point a, 0 < a < 1, we can choose ng so large such that 0 < a — ng: <a< 
atng’ <1,ie., {a} C (a—ng',a+ng’). Hence, 


1 O0<a<l 


0 elsewhere. (1.5.4) 


1 1 2 
P(X =a) <P(a~2<X<atz) ==, for all n > no. (1.5.5) 
n n n 


Since 2/n — 0 as n > co and a is arbitrary, we conclude that P(X = a) = 0 for 
all a € (0,1). Hence, the reasonable pdf, (1.5.4), for this model excludes a discrete 
probability model. 


Remark 1.5.1. In equations (1.5.1) and (1.5.2), the subscript X on px and fx 
identifies the pmf and pdf, respectively, with the random variable. We often use 
this notation, especially when there are several random variables in the discussion. 
On the other hand, if the identity of the random variable is clear, then we often 
suppress the subscripts. 


The pmf of a discrete random variable and the pdf of a continuous random 
variable are quite different entities. The distribution function, though, uniquely 
determines the probability distribution of a random variable. It is defined by: 


Definition 1.5.2 (Cumulative Distribution Function). Let X be a random variable. 
Then its cumulative distribution function (cdf) is defined by Fx (x), where 


Fx («%) = Px((—o00, 2]) = P({c EC: X(c) < z}). (1.5.6) 


As above, we shorten P({c € C : X(c) < x}) to P(X < a). Also, Fx (a) is 
often called simply the distribution function (df). However, in this text, we use the 
modifier cumulative as Fx (x) accumulates the probabilities less than or equal to «. 

The next example discusses a cdf for a discrete random variable. 


Example 1.5.3. Suppose we roll a fair die with the numbers 1 through 6 on it. 
Let X be the upface of the roll. Then the space of X is {1,2,...,6} and its pmf 
is px(t) = 1/6, fori = 1,2,...,6. Ifa < 1, then Fy(x) = 0. If 1 < a < 2, then 
Fx (x) = 1/6. Continuing this way, we see that the cdf of X is an increasing step 
function which steps up by px (i) at each 7 in the space of X. The graph of F'y is 
given by Figure 1.5.1. Note that if we are given the cdf, then we can determine the 
pmf of X. m 
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F(x) 
A 
1.0 + ——— 
—_——o 
—_————_0 
0.5 + ———-O 
—_——_0 
——o 

—_————_+ t t t t t+——> X 
(0, 0) 1 2 3 4 5 6 


Figure 1.5.1: Distribution function for Example 1.5.3. 


The following example discusses the cdf for the continuous random variable 
discussed in Example 1.5.2. 


Example 1.5.4 (Continuation of Example 1.5.2). Recall that X denotes a real 
number chosen at random between 0 and 1. We now obtain the cdf of X. First, if 
x <0, then P(X < x) =0. Next, if # > 1, then P(X < x) =1. Finally, if0<a< 
1, it follows from expression (1.5.3) that P(X <2) = P(0< X <a#)=ax2-O0=2z. 
Hence the cdf of X is 


0 ifa<0 
Fx(az)=<{ «& if0<a<1 (1.5.7) 
1 ifa#>1. 


A sketch of the cdf of X is given in Figure 1.5.2. Note, however, the connection 
between Fx (x) and the pdf for this experiment fx (x), given in Example 1.5.2, is 


Fy (x) = / fx (t) dt, forallze€ R, 


and Fy (x) = fx(z), for all x € R, except fort =0 andz=1. — 


Let X and Y be two random variables. We say that X and Y are equal in 
distribution and write X 2 Y if and only if Fy(x) = Fy(«), for alla € R. It 
is important to note while X and Y may be equal in distribution, they may be 
quite different. For instance, in the last example define the random variable Y as 
Y =1-—X. Then Y #4 X. But the space of Y is the interval (0,1), the same as X. 
Further, the cdf of Y is 0 for y < 0; 1 for y > 1; and for 0 < y < 1, it is 


Fy(y)=P(Y <y)=PU-X <y)=P(X 21-y)=1-(1-y) =y. 


Hence, Y has the same cdf as X, i.e., Y 2 X,but YAX. 
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F(x) 
A 


1 


(0, 0) 1 


Figure 1.5.2: Distribution function for Example 1.5.4. 


The cdfs displayed in Figures 1.5.1 and 1.5.2 show increasing functions with lower 
limits 0 and upper limits 1. In both figures, the cdfs are at least right continuous. 
As the next theorem proves, these properties are true in general for cdfs. 


Theorem 1.5.1. Let X be a random variable with cumulative distribution function 
F(x). Then 


(a) For alla and b, if a <b, then F(a) < F(b) (F is nondecreasing). 
(b) limg +o. F(x) = 0 (the lower limit of F is 0). 

(c) limgso0 F(x) = 1 (the upper limit of F is 1). 

(d) lim,,. | ato! (2) = F (xo) (F is right continuous). 

Proof: We prove parts (a) and (d) and leave parts (b) and (c) for Exercise 1.5.10. 
Part (a): Because a < b, we have {X < a} Cc {X < b}. The result then follows 
from the monotonicity of P; see Theorem 1.3.3. 

Part (d): Let {x,,} be any sequence of real numbers such that x, | x. Let C, = 
{X <«,}. Then the sequence of sets {C;,} is decreasing and NPC, = {X < xo}. 
Hence, by Theorem 1.3.6, 


lim F(a) = -°(A Cr )nn Xo), 
which is the desired result. 


The next theorem is helpful in evaluating probabilities using cdfs. 


Theorem 1.5.2. Let X be a random variable with the cdf Fx. Then fora < 6, 
Pla< X <b] = Fx(b) — Fx (a). 
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Proof: Note that 
{-~ < X < bh = {-w < X <asU{a< X < Dd}. 


The proof of the result follows immediately because the union on the right side of 
this equation is a disjoint union. m 


Example 1.5.5. Let X be the lifetime in years of a mechanical part. Assume that 


X has the cdf 
0 xr<0 
l-e* O<a. 
The pdf of X, #Fx(z), is 


e* 0<2<c@ 
fx(a) = { 0 elsewhere. 


Actually the derivative does not exist at = 0, but in the continuous case the next 
theorem (1.5.3) shows that P(X = 0) = 0 and we can assign fx(0) = 0 without 
changing the probabilities concerning X. The probability that a part has a lifetime 
between one and three years is given by 


3 
P(1< X <3) =Fx(3) — Fx(1) = / e "da. 
1 
That is, the probability can be found by Fx (3) — Fx (1) or evaluating the integral. 


In either case, it equals e~! — e~? = 0.318. m 


Theorem 1.5.1 shows that cdfs are right continuous and monotone. Such func- 
tions can be shown to have only a countable number of discontinuities. As the next 
theorem shows, the discontinuities of a cdf have mass; that is, if x is a point of 
discontinuity of Fx, then we have P(X = x) > 0. 


Theorem 1.5.3. For any random variable, 
P|X = 2] = Fx(ax) — Fx(a-), (1.5.8) 
for all x € R, where Fx(a—) = limzy, Fx(z). 
Proof: For any x € R, we have 
n=1 m 
that is, {x} is the limit of a decreasing sequence of sets. Hence, by Theorem 1.3.6, 


A {o-2<x<e}| 


n=1 


P|xX=ca] = P 


l—Co 


lim ple <x<o| 
= lim [F'x (2) - Fy(a— (1/n))] 
Fx (x) — Fx(x—-), 
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which is the desired result. m 


Example 1.5.6. Let X have the discontinuous cdf 


0 x<0O 
Fy(@)=¢ «/2 O0<a<l 
1 l<a. 
Then 1 1 
P(-1< X < 1/2) = Fx (1/2) — Fx(-1) = ie 0= i 
and 1 1 
P(X =1)=Fy(1)-Fx( jel a=: 


The value 1/2 equals the value of the step of Fy at «= 1. = 


Since the total probability associated with a random variable X of the discrete 
type with pmf px (a) or of the continuous type with pdf fx(a) is 1, then it must be 
true that 

ne Px(z) = 1 and = fx(x) dx = 1, 
where D is the space of X. As the next two examples show, we can use this 


property to determine the pmf or pdf if we know the pmf or pdf down to a constant 
of proportionality. 


Example 1.5.7. Suppose X has the pmf 


is Ce wa 1,2)... ,10 
PX\*) = Q elsewhere, 
for an appropriate constant c. Then 


10 10 
1= 5 px(x) = So cw =c(1+2+--+ +10) = 55e, 
x=1 


x=1 
and, hence, c= 1/55. = 
Example 1.5.8. Suppose X has the pdf 


fet cv® O0<a<2 
m=") “0 elsewhere, 


2 4]? 
i= f co de = |= = 4c, 
0 4} 


and, hence, c = 1/4. For illustration of the computation of a probability involving 
X, we have 


for a constant c. Then 


1 1 3 255 
Pl-<X<1lj= — dx = — = 0.06226. m 
€ ) I, a“ ~ 7096 
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EXERCISES 


1.5.1. Let a card be selected from an ordinary deck of playing cards. The outcome 
c is one of these 52 cards. Let X(c) = 4 if c is an ace, let X(c) = 3 if c is a king, 
let X(c) = 2 if c is a queen, let X(c) = 1 if cis a jack, and let X(c) = 0 otherwise. 
Suppose that P assigns a probability of = to each outcome c. Describe the induced 
probability Py(D) on the space D = {0,1,2,3,4} of the random variable X. 


1.5.2. For each of the following, find the constant c so that p(x) satisfies the con- 
dition of being a pmf of one random variable X. 


(a) p(x) = c($)", © =1,2,3,..., zero elsewhere. 
(b) p(x) = ca, x = 1,2,3,4,5,6, zero elsewhere. 


1.5.3. Let px(x) = 2/15, « = 1,2,3,4,5, zero elsewhere, be the pmf of X. Find 
P(X =1or 2), P(§ < X < 3), and P(l< X < 2). 


1.5.4. Let px(x) be the pmf of a random variable X. Find the cdf F(x) of X and 
sketch its graph along with that of px (za) if: 


(a) px(x) =1, x = 0, zero elsewhere. 
(b) px(x) = 3, x = —1,0,1, zero elsewhere. 
(c) px(x) = #/15, « = 1,2,3,4,5, zero elsewhere. 


1.5.5. Let us select five cards at random and without replacement from an ordinary 
deck of playing cards. 


(a) Find the pmf of X, the number of hearts in the five cards. 
(b) Determine P(X < 1). 


1.5.6. Let the probability set function of the random variable X be Px(D) = 
Jp f(x) da, where f(x) = 22/9, for z € D = {x : 0 < z < 3}. Define the events 
Di ={%@:0<a <1} and Dy = {4:2 <x <3}. Compute Px(D,), Px(Dz2), and 
Px(D, U2), 


1.5.7. Let the space of the random variable X be D = {x : 0 < a < l}. If 
D, = {z:0<2< 4} and Dj ={x: 4 <2 <1}, find Py(Dp) if Px(D) = §. 


1.5.8. Suppose the random variable X has the cdf 


0 i Si 
F(a)=4 2 -1<2¢<1 
1. 1< a2. 


Write an R function to sketch the graph of F(x). Use your graph to obtain the 
probabilities: (a) P(—3 < X < 4); (b) P(X =0); (c) P(X = 1); (d) P(2< X <3). 
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1.5.9. Consider an urn that contains slips of paper each with one of the num- 
bers 1,2,...,100 on it. Suppose there are i slips with the number i on it for 
i =1,2,...,100. For example, there are 25 slips of paper with the number 25. As- 
sume that the slips are identical except for the numbers. Suppose one slip is drawn 
at random. Let X be the number on the slip. 


(a) Show that X has the pmf p(a) = «7/5050, « = 1,2,3,...,100, zero elsewhere. 
(b) Compute P(X < 50). 
(c) Show that the cdf of X is F(x) = [x]({z] + 1)/10100, for 1 < x < 100, where 


[x] is the greatest integer in x. 
1.5.10. Prove parts (b) and (c) of Theorem 1.5.1. 


1.5.11. Let X be a random variable with space D. For D C D, recall that the 
probability induced by X is Px(D) = Pli{c: X(c) € D}]. Show that Px(D) isa 
probability by showing the following: 
(a) Px(D) = 1. 
(b) Px(D) 2 0. 
(c) For a sequence of sets {D,,} in D, show that 
{c: X(c) € UnDn} = Un{e: X(c) € Dn}. 


(d) Use part (c) to show that if {D,,} is sequence of mutually exclusive events, 
then 


Py (UD, = S_ Px(Dp). 
n=1 


Remark 1.5.2. In a probability theory course, we would show that the o-field 
(collection of events) for D is the smallest o-field which contains all the open intervals 
of real numbers; see Exercise 1.3.24. Such a collection of events is sufficiently rich 
for discrete and continuous random variables. m 


1.6 Discrete Random Variables 


The first example of a random variable encountered in the last section was an 
example of a discrete random variable, which is defined next. 


Definition 1.6.1 (Discrete Random Variable). We say a random variable is a 
discrete random variable if its space is either finite or countable. 


Example 1.6.1. Consider a sequence of independent flips of a coin, each resulting 
in a head (H) or a tail (T). Moreover, on each flip, we assume that H and T are 
equally likely; that is, P(H) = P(T) = 5. The sample space C consists of sequences 
like TTHTHHT---. Let the random variable X equal the number of flips needed 
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to obtain the first head. Hence, X(TTHTHHT- --) = 3. Clearly, the space of X is 
D = {1,2,3,4,...}. We see that X = 1 when the sequence begins with an H and 
thus P(X = 1) = 4. Likewise, X = 2 when the sequence begins with TH, which 
has probability P(X = 2) = (4)(4) = ; from the independence. More generally, 
if X = x, where x = 1,2,3,4,..., there must be a string of x — 1 tails followed 
by a head; that is, TT---TH, where there are x — 1 tails in TT---T. Thus, from 


independence, we have a geometric sequence of probabilities, namely, 


Peas (i) (5) = (5) . PtSi: (1.6.1) 


the space of which is countable. An interesting event is that the first head appears 
on an odd number of flips; i.e., X € {1,3,5,...}. The probability of this event is 


-1 


PIX € {1,3,5,...}] -»-(3) = pS (3) = ee 7 :. . 


As the last example suggests, probabilities concerning a discrete random vari- 
able can be obtained in terms of the probabilities P(X = x), for x € D. These 
probabilities determine an important function, which we define as 


Definition 1.6.2 (Probability Mass Function (pmf)). Let X be a discrete random 
variable with space D. The probability mass function (pmf) of X is given by 


px(%)=P[|X =a], forxeD. (1.6.2) 
Note that pmfs satisfy the following two properties: 
(i) 0< px(x) <1,2€D, and (ii) Pep px (x) = 1. (1.6.3) 


In a more advanced class it can be shown that if a function satisfies properties (i) 
and (ii) for a discrete set D, then this function uniquely determines the distribution 
of a random variable. 

Let X be a discrete random variable with space D. As Theorem 1.5.3 shows, 
discontinuities of F’y (a) define a mass; that is, if z is a point of discontinuity of Fy, 
then P(X = 2) > 0. We now make a distinction between the space of a discrete 
random variable and these points of positive probability. We define the support of 
a discrete random variable X to be the points in the space of X which have positive 
probability. We often use S to denote the support of X. Note that S C D, but it 
may be that S = D. 

Also, we can use Theorem 1.5.3 to obtain a relationship between the pmf and 
cdf of a discrete random variable. If « € S, then px(x) is equal to the size of the 
discontinuity of Fy at x. Ifa ¢S then P[X = 2] = 0 and, hence, Fx is continuous 
at this x. 


Example 1.6.2. A lot, consisting of 100 fuses, is inspected by the following proce- 
dure. Five of these fuses are chosen at random and tested; if all five “blow” at the 
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correct amperage, the lot is accepted. If, in fact, there are 20 defective fuses in the 
lot, the probability of accepting the lot is, under appropriate assumptions, 


80 
(5) 
TOO 
(5°) 
More generally, let the random variable X be the number of defective fuses among 
the five that are inspected. The pmf of X is given by 


= 0.31931. 


(7°) ( 80 ) 
2m forx=0,1,2,3,4,5 
px(z)=4 — Ce") a (1.6.4) 
0 elsewhere. 


Clearly, the space of X is D = {0,1,2,3,4,5}, which is also its support. This is an 
example of a random variable of the discrete type whose distribution is an illustra- 
tion of a hypergeometric distribution, which is formally defined in Chapter 3. 
Based on the above discussion, it is easy to graph the cdf of X; see Exercise 1.6.5. 
a 


1.6.1 Transformations 


A problem often encountered in statistics is the following. We have a random 
variable X and we know its distribution. We are interested, though, in a random 
variable Y which is some transformation of X, say, Y = g(X). In particular, 
we want to determine the distribution of Y. Assume X is discrete with space Dx. 
Then the space of Y is Dy = {g(x) : «x € Dx}. We consider two cases. 

In the first case, g is one-to-one. Then, clearly, the pmf of Y is obtained as 


py(y) = PIY =y] = Plo(X) =y) = PIX =g"(y)] =px(g*(y)). (1.6.5) 


Example 1.6.3. Consider the random variable X of Example 1.6.1. Recall that X 
was the flip number on which the first head appeared. Let Y be the number of flips 
before the first head. Then Y = X — 1. In this case, the function g is g(x) = a—-1, 
whose inverse is given by g-!(y) = y+1. The space of Y is Dy = {0,1,2,...}. The 
pmf of X is given by (1.6.1); hence, based on expression (1.6.5), the pmf of Y is 


1 yt1 
pv(v) =pxtu+1) = (5) , fory=0,1,2,.... = 
Example 1.6.4. Let X have the pmf 


31 f2\@ /1\3-2 
ad -gyacey (3) (3) z=0,1,2,3 
px(2) 0 elsewhere. 


We seck the pmf py(y) of the random variable Y = X?. The transformation 
y = g(x) = x? maps Dy = {x: x = 0,1,2,3} onto Dy = {y: y = 0,1,4,9}. In 
general, y = x” does not define a one-to-one transformation; here, however, it does, 
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for there are no negative values of « in Dy = {a: « = 0,1,2,3}. That is, we have 
the single-valued inverse function « = g~'(y) = \/y (not —\/y), and so 


3! Pate ie il Wim 
pr) =x) = GG (F) (=) » y=0,1,4,9. @ 


The second case is where the transformation, g(x), is not one-to-one. Instead of 
developing an overall rule, for most applications involving discrete random variables 
the pmf of Y can be obtained in a straightforward manner. We offer two examples 
as illustrations. 

Consider the geometric random variable in Example 1.6.3. Suppose we are 
playing a game against the “house” (say, a gambling casino). If the first head appears 
on an odd number of flips, we pay the house one dollar, while if it appears on an 
even number of flips, we win one dollar from the house. Let Y denote our net gain. 
Then the space of Y is {—1,1}. In Example 1.6.1, we showed that the probability 
that X is odd is Z. Hence, the distribution of Y is given by py(—1) = 2/3 and 
py (1) = 1/8. 

As a second illustration, let Z = (X — 2)?, where X is the geometric random 
variable of Example 1.6.1. Then the space of Z is Dz = {0,1,4,9,16,...}. Note 
that Z = 0 if and only if X = 2; Z =1 if and only if X = 1 or X = 3; while for the 
other values of the space there is a one-to-one correspondence given by x = \/z+2, 
for z € {4,9,16,...}. Hence, the pmf of Z is 

px (2) = 4 for z= 0 
pz(z)=¢ px(1)+px(83)=3 — forz=1 (1.6.6) 
px(vz+2)=4(4)%* for z=4,9,16,.... 


For verification, the reader is asked to show in Exercise 1.6.11 that the pmf of Z 
sums to | over its space. 


EXERCISES 


1.6.1. Let X equal the number of heads in four independent flips of a coin. Using 
certain assumptions, determine the pmf of X and compute the probability that X 
is equal to an odd number. 


1.6.2. Let a bowl contain 10 chips of the same size and shape. One and only one 
of these chips is red. Continue to draw chips from the bowl, one at a time and at 
random and without replacement, until the red chip is drawn. 


(a) Find the pmf of X, the number of trials needed to draw the red chip. 
(b) Compute P(X < 4). 


1.6.3. Cast a die a number of independent times until a six appears on the up side 
of the die. 


(a) Find the pmf p(x) of X, the number of casts needed to obtain that first six. 
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(b) Show that 5°°°, p(x) = 1. 
(c) Determine P(X = 1,3,5,7,...). 
(d) Find the cdf F(x) = P(X < 2). 


1.6.4. Cast a die two independent times and let X equal the absolute value of the 
difference of the two resulting values (the numbers on the up sides). Find the pmf 
of X. Hint: It is not necessary to find a formula for the pmf. 


1.6.5. For the random variable X defined in Example 1.6.2: 


(a) Write an R function that returns the pmf. Note that in R, choose(m,k) 


computes (‘). 


(b) Write an R function that returns the the graph of the cdf. 
1.6.6. For the random variable X defined in Example 1.6.1, graph the cdf of X. 


1.6.7. Let X have a pmf p(x) = z, x = 1,2,3, zero elsewhere. Find the pmf of 
Y=2X +1. 


1.6.8. Let X have the pmf p(x) = (5)", x = 1,2,3,..., zero elsewhere. Find the 
pmf of Y = X°. 


1.6.9. Let X have the pmf p(x) = 1/3, z = —1,0,1. Find the pmf of Y = X?. 
1.6.10. Let X have the pmf 


irl 
ote) = (5) , ©=—1,-2,-3,.... 


Find the pmf of Y = X*. 


1.6.11. Show that the function given in expression (1.6.6) is a pmf. 


1.7 Continuous Random Variables 


In the last section, we discussed discrete random variables. Another class of random 
variables important in statistical applications is the class of continuous random 
variables, which we define next. 


Definition 1.7.1 (Continuous Random Variables). We say a random variable is a 
continuous random variable if its cumulative distribution function Fx (x) is a 
continuous function for alla € R. 


Recall from Theorem 1.5.3 that P(X = x2) = Fx(x) — Fx (a—), for any random 
variable X. Hence, for a continuous random variable X, there are no points of 
discrete mass; i.e., if X is continuous, then P(X = x) = 0 for all x € R. Most 
continuous random variables are absolutely continuous; that is, 


Fx (x) = [ fx(t) dt, (1.7.1) 
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for some function fx (t). The function fx (t) is called a probability density func- 
tion (pdf) of X. If fx(a) is also continuous, then the Fundamental Theorem of 
Calculus implies that 
* Fx(0) = fx(a). (1.7.2) 
v 

The support of a continuous random variable X consists of all points x such 
that fx(a) > 0. As in the discrete case, we often denote the support of X by S. 

If X is a continuous random variable, then probabilities can be obtained by 
integration; 1.e., 


b 
P(a< X <b) = Fx(b) — Fx(a) -| fx (t) dt. 
Also, for continuous random variables, 
Pla< X <b)=P(a<X <b)=P(a<X <b)=P(a<X <b). 
From the definition (1.7.2), note that pdfs satisfy the two properties 
(i) fx(v) > 0 and (ii) f°) fix (t) dt = 1. (1.7.3) 


The second property, of course, follows from F'y (oo) = 1. In an advanced course in 
probability, it is shown that if a function satisfies the above two properties, then it 
is a pdf for a continuous random variable; see, for example, Tucker (1967). 

Recall in Example 1.5.2 the simple experiment where a number was chosen 
at random from the interval (0,1). The number chosen, X, is an example of a 
continuous random variable. Recall that the cdf of X is Fx(#) = 2, forO<a<1. 
Hence, the pdf of X is given by 


fx(2) ={ 


Any continuous or discrete random variable X whose pdf or pmf is constant on 
the support of X is said to have a uniform distribution; see Chapter 3 for a more 
formal definition. 


1 0<e<1 


0 elsewhere. (1.7.4) 


Example 1.7.1 (Point Chosen at Random Within the Unit Circle). Suppose we 
select a point at random in the interior of a circle of radius 1. Let X be the 
distance of the selected point from the origin. The sample space for the experiment 
is C = {(w,y) : w? +y? < 1}. Because the point is chosen at random, it seems 
that subsets of C which have equal area are equilikely. Hence, the probability of the 
selected point lying in a set A C C is proportional to the area of A; i-e., 


P(A) = area of A 
T 
For 0 < x < 1, the event {X < x} is equivalent to the point lying in a circle of 


radius x. By this probability rule, P(X < x) = rx?/m = x7; hence, the cdf of X is 


0 «<0 
Fy(x)=¢ 2? O<a2<1 (1.75) 
1 La: 
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Taking the derivative of Fx (a), we obtain the pdf of X: 


2x O<a<l 


fx(a) = { 0 elsewhere. (1.7.6) 


For illustration, the probability that the selected point falls in the ring with radii 
1/4 and 1/2 is given by 


1 

1 1 2 
_ << — =. 2 
p(Z<x<5) [raw w 


Example 1.7.2. Let the random variable be the time in seconds between incoming 
telephone calls at a busy switchboard. Suppose that a reasonable probability model 
for X is given by the pdf 


1e*/4 0<2r<o 
_ qe 
fx(a) = { 0 elsewhere. 


~ 16° 


i 
zi 


Note that fx satisfies the two properties of a pdf, namely, (i) f(x) > 0 and (ii) 


i 1 -«/4 dz = —e 2/4 
0) 4 


= 1, 
0 


For illustration, the probability that the time between successive phone calls exceeds 
4 seconds is given by 


aa 
P(X > 4) =} re dz = e~' = 0.3679. 
4 
The pdf and the probability of interest are depicted in Figure 1.7.1. From the figure, 
the pdf has a long right tail and no left tail. We say that this distribution is skewed 
right or positively skewed. This is an example of a gamma distribution which is 
discussed in detail in Chapter 3. m 


1.7.1 Quantiles 
Quantiles (percentiles) are easily interpretable characteristics of a distribution. 


Definition 1.7.2 (Quantile). Let 0 < p< 1. The quantile of order p of the 
distribution of a random variable X is a value & such that P(X < &) < p and 
P(X <&,) > p. It is also known as the (100p)th percentile of X. m 


Examples include the median which is the quantile €,/2. The median is also 
called the second quartile. It is a point in the domain of X that divides the mass 
of the pdf into its lower and upper halves. The first and third quartiles divide each 
of these halves into quarters. They are, respectively €,/4 and 3/4. We label these 
quartiles as q1,q2 and q3, respectively. The difference iq = q3 — q, is called the 
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0.2 5 


0.15 


(0, 0) 


Figure 1.7.1: In Example 1.7.2, the area under the pdf to the right of 4 is P(X > 
4). 


interquartile range of X. The median is often used as a measure of center of the 
distribution of X, while the interquartile range is used as a measure of spread or 
dispersion of the distribution of X. 

Quantiles need not be unique even for continuous random variables with pdfs. 
For example, any point in the interval (2,3) serves as a median for the following 
pdf: 

3(1—a2)(a-2) l<a<2 
f(t) =< 3(3-a2)(2-4) 3<a<4 (1.7.7) 
0 elsewhere. 


If, however, a quantile, say €,, is in the support of an absolutely continuous random 
variable X with cdf Fx (x) then &, is the unique solution to the equation: 


b= Fx (P), (1.7.8) 


where Fx'(u) is the inverse function of Fy(a). The next example serves as an 
illustration. 


Example 1.7.3. Let X be a continuous random variable with pdf 


eX 


f(x) = 


This pdf is a member of the log F-family of ditributions which is often used in the 
modeling of the log of lifetime data. Note that X has the support space (—oo, co). 
The cdf of X is 


F(@)=1-(1+5e*)"*, -co <2 < 00, 


1.7. Continuous Random Variables 53 


which is confirmed immediately by showing that F’(”) = f(a). For the inverse of 
the cdf, set u = F(a) and solve for u. A few steps of algebra lead to 


F-*(u) = log {.2[(l1—u)~°-1]}, O<u<1. 


Thus, € = Fx" (p) = log {.2 [(1 — p)~* — 1]} . The following three R functions can 
be used to compute the pdf, cdf, and inverse cdf of F’, respectively. These can be 
downloaded at the site listed in the Preface. 


dlogF <- function(x) fexp(x)/(1+5*exp(x))*(1.2)} 
plogF <- function(x){1- (1+5*exp(x))*(-.2)} 
qlogF <- function(x){log(.2*((1-x)*(-5) - 1))} 


Once the R function qlogF is sourced, it can be used to compute quantiles. The 
following is an R script which results in the computation of the three quartiles of 
Xx: 


qlogF(.25) ; qlogF(.50); qlogF(.75) 
-0.4419242; 1.824549; 5.321057 


Figure 1.7.2 displays a plot of this pdf and its quartiles. Notice that this is another 
example of a skewed-right distribution; i-e., the right-tail is much longer than left- 
tail. In terms of the log-lifetime of mechanical parts having this distribution, it 
follows that 50% of the parts survive beyond 1.83 log-units and 25% of the parts 
live longer than 5.32 log-units. With the long-right tail, some parts attain a long 
life. m 


1.7.2 Transformations 


Let X be a continuous random variable with a known pdf fx. As in the discrete 
case, we are often interested in the distribution of a random variable Y which is 
some transformation of X, say, Y = g(X). Often we can obtain the pdf of Y by 
first obtaining its cdf. We illustrate this with two examples. 


Example 1.7.4. Let X be the random variable in Example 1.7.1. Recall that X 
was the distance from the origin to the random point selected in the unit circle. 
Suppose instead that we are interested in the square of the distance; that is, let 
Y = X?. The support of Y is the same as that of X, namely, Sy = (0,1). What is 
the cdf of Y? By expression (1.7.5), the cdf of X is 


0 «<0 
Poin =< 2 eee 1 (1.7.10) 
1 Te: 


Let y be in the support of Y; i.e., 0 < y < 1. Then, using expression (1.7.10) and 
the fact that the support of X contains only positive numbers, the cdf of Y is 


Fy(y) = PY <y) =P? <y=PX< = Fel M= VV =y. 
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Figure 1.7.2: A graph of the pdf (1.7.9) showing the three quartiles, q@,q2, and 
q3, of the distribution. The probability mass in each of the four sections is 1/4. 


It follows that the pdf of Y is 


jl o<y<i 
fry) = { 0 elsewhere. 


Example 1.7.5. Let fx(x) = 4, —1 < x < 1, zero elsewhere, be the pdf of a 
random variable X. Note that X has a uniform distribution with the interval of 
support (—1,1). Define the random variable Y by Y = X?. We wish to find the 


pdf of Y. If y > 0, the probability P(Y < y) is equivalent to 


P(X? <y) = P(-V9< X < Vi). 
Accordingly, the cdf of Y, Fy(y) = P(Y < y), is given by 


0 y <0 
Fy(y) = [rie = fy O0<y<l 
1 1l<y. 


Hence, the pdf of Y is given by 


1 
x O<y<il 

— 2/y 
fy) { 0 elsewhere. 


These examples illustrate the cumulative distribution function technique. 
The transformation in Example 1.7.4 is one-to-one, and in such cases we can obtain 
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a simple formula for the pdf of Y in terms of the pdf of X, which we record in the 
next theorem. 


Theorem 1.7.1. Let X be a continuous random variable with pdf fx (x) and support 
Sx. Let Y = g(X), where g(x) is a one-to-one differentiable function, on the sup- 
port of X, Sx. Denote the inverse of g by x = g~'(y) and let dx/dy = d{g~1(y)]/dy. 
Then the pdf of Y is given by 


d. 
fey) = feo") EA eee. a7at) 


where the support of Y is the set Sy = {y=g(x): ce Sx}. 


Proof: Since g(a) is one-to-one and continuous, it is either strictly monotonically 
increasing or decreasing. Assume that it is strictly monotonically increasing, for 
now. The cdf of Y is given by 


Fy(y) = PLY <y] = Plo(X) Sy) = PIX Sg) = Fx(g7"(y)). (1.712) 
Hence, the pdf of Y is 


dx 


fyy) =—Fy(y) = fxg") Ge (1.7.18) 


where dx/dy is the derivative of the function x = g~'(y). In this case, because g is 
increasing, dx/dy > 0. Hence, we can write dx/dy = |dx/dy|. 

Suppose g(a) is strictly monotonically decreasing. Then (1.7.12) becomes Fy (y) = 
1— Fx(g~*(y)). Hence, the pdf of Y is fy(y) = fx(g~*(y))(—dx/dy). But since g 
is decreasing, dx/dy < 0 and, hence, —dx/dy = |dx/dy|. Thus Equation (1.7.11) is 
true in both cases.° . 


Henceforth, we refer to dx/dy = (d/dy)g~*(y) as the Jacobian (denoted by J) 
of the transformation. In most mathematical areas, J = da/dy is referred to as the 
Jacobian of the inverse transformation z = g~'(y), but in this book it is called the 
Jacobian of the transformation, simply for convenience. 

We summarize Theorem 1.7.1 in a simple algorithm which we illustrate in the 
next example. Assuming that the transformation Y = g(X) is one-to-one, the 
following steps lead to the pdf of Y: 


1. Find the support of Y. 


2. Solve for the inverse of the transfomation; i.e., solve for x in terms of y in 
y = g(x), thereby obtaining x = g~!(y). 


3. Obtain ge 
y 


dx 


4. The pdf of Y is fy (y) = fx(g7"(y)) aa" 


5The proof of Theorem 1.7.1 can also be obtained by using the change-of-variable technique as 
discussed in Chapter 4 of Mathematical Comments. 
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Example 1.7.6. Let X have the pdf 


Ae? Ve 1 
i { 0 elsewhere. 
Consider the random variable Y = — log X. Here are the steps of the above algo- 


rithm: 
1. The support of Y = — log X is (0,00). 
2. Ify=—logax then x =e%. 
3. = =-e. 
4. Thus the pdf of Y is: 
fv (y) = fx (e-¥) |-e-¥| =4(e¥)FeY = de, 
1.7.3. Mixtures of Discrete and Continuous Type Distribu- 
tions 


We close this section by two examples of distributions that are not of the discrete 
or the continuous type. 


Example 1.7.7. Let a distribution function be given by 


0 xr<0 
F(ia)=4  O0<a<1 
1 l<« 


and 1 1 
P(X = 0) = F(0) — F(0-) = oe 
The graph of F(a) is shown in Figure 1.7.3. We see that F(a) is not always 
continuous, nor is it a step function. Accordingly, the corresponding distribution is 
neither of the continuous type nor of the discrete type. It may be described as a 
mixture of those types. @ 


Distributions that are mixtures of the continuous and discrete type do, in fact, 
occur frequently in practice. For illustration, in life testing, suppose we know that 
the length of life, say X, exceeds the number b, but the exact value of X is unknown. 
This is called censoring. For instance, this can happen when a subject in a cancer 
study simply disappears; the investigator knows that the subject has lived a certain 
number of months, but the exact length of life is unknown. Or it might happen 
when an investigator does not have enough time in an investigation to observe the 
moments of deaths of all the animals, say rats, in some study. Censoring can also 
occur in the insurance industry; in particular, consider a loss with a limited-pay 
policy in which the top amount is exceeded but it is not known by how much. 
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Figure 1.7.3: Graph of the cdf of Example 1.7.7. 


Example 1.7.8. Reinsurance companies are concerned with large losses because 
they might agree, for illustration, to cover losses due to wind damages that are 
between $2,000,000 and $10,000,000. Say that X equals the size of a wind loss in 
millions of dollars, and suppose it has the cdf 


0 -—o<2<0 
Fx (x) = 3 


1 (qt) 0<4<om. 


If losses beyond $10,000,000 are reported only as 10, then the cdf of this censored 
distribution is 


0 —o<y<0 
3 
Fy(y=4 1- (ah) 0<y<10, 
1 1l0<y<o, 


which has a jump of [10/(10 + 10)]? = } at y= 10. m 


EXERCISES 


1.7.1. Let a point be selected from the sample space C = {c: 0 < c < 10}. Let 
C CC and let the probability set function be P(C) = J. 7 dz. Define the random 
variable X to be X(c) = c”. Find the cdf and the pdf of X. 


1.7.2. Let the space of the random variable X be C = {x : 0 < x < 10} and 
let Px(C1) = 2, where C) = {x : 1 < x < 5}. Show that Px(C2) < 2, where 
Cp = {4:5 <a < 10}. 

1.7.3. Let the subsets C) = {4 < x < $} and Cy = {§ < x < 1} of the space 
C = {x : 0 < x < 1} of the random variable X be such that Px(C,) = } and 
Px (C2) = 5. Find Px(C1 UC2), Px(Cf), and Px(C£N C$). 
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1.7.4. Given f,[1/a(1 + 2?)] dx, where C C C = {a : —co < & < 00}. Show that 
the integral could serve as a probability set function of a random variable X whose 
space is C. 


1.7.5. Let the probability set function of the random variable X be 
Px(C) =i e “dz, whereC={x:0<a4< oo}. 
C 

Let Ch, = {1 :2-1/k <a < 3}, k =1,2,3,.... Find the limits limp. C, and 
Px (limp oo Cy). Find Py (Cr) and show that lim,z_... Px (Cr) = Px (limp—oo Cy). 
1.7.6. For each of the following pdfs of X, find P(|X|< 1) and P(X? < 9). 

(a) f(x) = 27/18, —3 < x < 3, zero elsewhere. 

(b) f(a) = (a+ 2)/18, —2 < a < 4, zero elsewhere. 


1.7.7. Let f(x) = 1/2?, 1 < x < oo, zero elsewhere, be the pdf of X. If C, = {a: 
i <r a 2} and C2 = {x 24S oS 5}, find Px (Cy UC) and Px (Cy NC). 


1.7.8. A mode of the distribution of a random variable X is a value of x that 
maximizes the pdf or pmf. If there is only one such g, it is called the mode of the 
distribution. Find the mode of each of the following distributions: 


(a) p(x) = ($)”, e =1,2,3,..., zero elsewhere. 


(b) f(x) = 1227(1 — x), 0 < x <1, zero elsewhere. 
(c) f(x) = (4)ae~*, 0 < x < ov, zero elsewhere. 


1.7.9. The median and quantiles, in general, are discussed in Section 1.7.1. Find 
the median of each of the following distributions: 


a) p(x) = aan (5)"(4)**, 2 = 0,1,2,3,4, zero elsewhere. 
aaa a) a 

(b) f(x) = 327, 0 <a <1, zero elsewhere. 

(c) f(®) = aque OO < B< OW. 


1.7.10. Let 0< p< 1. Find the 0.20 quantile (20th percentile) of the distribution 
that has pdf f(x) = 4x3, 0 < x < 1, zero elsewhere. 


1.7.11. For each of the following cdfs F(x), find the pdf f(x) [pmf in part (d)], 
the first quartile, and the 0.60 quantile. Also, sketch the graphs of f(x) and F(z). 
May use R to obtain the graphs. For Part(a) the code is provided. 


(a) F(x) = $+ 4tan'(x),—-00 < @ < ov. 


x<-seq(-5,5,.01); y<-.5tatan(x)/pi; y2<-1/(pi*(1+x*2)) 
par (mfrow=c(1,2));plot(y~x) ; plot (y27x) 
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(b) F(x) =exp{—e~*} ,-w<4<m. 


(c) F(x) = (1+e-*)"1,-0 <2 <0. 


(4) Fe) = Dj 3)’: 
1.7.12. Find the cdf F(a) associated with each of the following probability density 
functions. Sketch the graphs of f(x) and F(x). 

(a) f(x) =3(1-2)?, 0< a <1, zero elsewhere. 

(b) f(x) =1/x?, 1 < x < 00, zero elsewhere. 

(c) f(t) =4,0<a<lor2<a <4, zero elsewhere. 
Also, find the median and the 25th percentile of each of these distributions. 


1.7.13. Consider the cdf F(a) = 1—e~* —xe~*, 0 < & < ~, zero elsewhere. Find 
the pdf, the mode, and the median (by numerical methods) of this distribution. 


1.7.14. Let X have the pdf f(x) = 2x, 0 < a < 1, zero elsewhere. Compute the 
probability that X is at least 4 given that X is at least 5. 


1.7.15. The random variable X is said to be stochastically larger than the 
random variable Y if 
P(X >z)>P(Y >2), (1.7.14) 


for all real z, with strict inequality holding for at least one z value. Show that this 
requires that the cdfs enjoy the following property: 


Fx (z) < Fy(z), 
for all real z, with strict inequality holding for at least one z value. 


1.7.16. Let X be a continuous random variable with support (—oo, 00). Consider 
the random variable Y = X + A, where A > 0. Using the definition in Exercise 
1.7.15, show that Y is stochastically larger than X. 


1.7.17. Divide a line segment into two parts by selecting a point at random. Find 
the probability that the length of the larger segment is at least three times the 
length of the shorter segment. Assume a uniform distribution. 


1.7.18. Let X be the number of gallons of ice cream that is requested at a certain 
store on a hot summer day. Assume that f(x) = 127(1000—2)?/10!", 0 < x < 1000, 
zero elsewhere, is the pdf of X. How many gallons of ice cream should the store 
have on hand each of these days, so that the probability of exhausting its supply 
on a particular day is 0.05? 


1.7.19. Find the 25th percentile of the distribution having pdf f(x) = |a|/4, where 
—2 <a < 2 and zero elsewhere. 


60 Probability and Distributions 


1.7.20. The distribution of the random variable X in Example 1.7.3 is often used 
to model the log of the lifetime of a mechanical or electrical part. What about the 
lifetime itself? Let Y = exp{X}. 


(a) Determine the range of Y. 
(b) Use the transformation technique to find the pdf of Y. 


(c) Write an R function to compute this pdf and use it to obtain a graph of the 
pdf. Discuss the plot. 


(d) Determine the 90th percentile of Y. 


1.7.21. The distribution of the random variable X in Example 1.7.3 is a member 
of the log-F’ familily. Another member has the cdf 


—5/2 


2 
F(e)= [1+ 5e>| , —0O<42< OO. 


(a) Determine the corresponding pdf. 


(b) Write an R function that computes this cdf. Plot the function and obtain 
approximations of the quartiles and median by inspection of the plot. 


(c) Obtain the inverse of the cdf and confirm the percentiles in Part (b). 


1.7.22. Let X have the pdf f(x) = 27/9, 0 < x < 3, zero elsewhere. Find the pdf 
of Y = X?. 


1.7.23. If the pdf of X is f(a) = Qre-*, 0 < x < co, zero elsewhere, determine 
the pdf of Y = X?. 


1.7.24. Let X have the uniform pdf fx(x) = +, for —3 <a < 4. Find the pdf of 
Y =tanX. This is the pdf of a Cauchy distribution. 


1.7.25. Let X have the pdf f(x) = 4x3, 0 < x < 1, zero elsewhere. Find the cdf 
and the pdf of Y = —In X%. 


1.7.26. Let f(x) = %, —1 <x < 2, zero elsewhere, be the pdf of X. Find the cdf 
and the pdf of Y = X?. 
Hint: Consider P(X? < y) for two cases: 0< y<landl<y<4. 


1.8 Expectation of a Random Variable 
In this section we introduce the expectation operator, which we use throughout 


the remainder of the text. For the definition, recall from calculus that absolute 
convergence of sums or integrals implies their convergence. 
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Definition 1.8.1 (Expectation). Let X be a random variable. If X is a continuous 
random variable with pdf f(x) and 


If X is a discrete random variable with pmf p(a) and 
>= Iz! p(2) <0, 
then the expectation of X is 


E(X)= S > x p(a). 


Sometimes the expectation F(X) is called the mathematical expectation of 
X, the expected value of X, or the mean of X. When the mean designation is 
used, we often denote the F(X) by ps; ie, uw = E(X). 


Example 1.8.1 (Expectation of a Constant). Consider a constant random variable, 
that is, arandom variable with all its mass at a constant k. This is a discrete random 
variable with pmf p(k) = 1. We have by definition that 


E(k) =kp(k) =k. (1.8.1) 


Example 1.8.2. Let the random variable X of the discrete type have the pmf given 
by the table 


2 
p(x) To 10 10 410 


Here p(x) = 0 if a is not equal to one of the first four positive integers. This 
illustrates the fact that there is no need to have a formula to describe a pmf. We 
have 


B(X) = (1) (=) + (2) (=) +(3) (=) + (4) (=) 7 = -23. » 


Example 1.8.3. Let the continuous random variable X have the pdf 


4e? O<ar<l 
0 elsewhere. 


f(z) = 


Then 
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Remark 1.8.1. The terminology of expectation or expected value has its origin 
in games of chance. For example, consider a game involving a spinner with the 
numbers 1, 2, 3 and 4 on it. Suppose the corresponding probabilities of spinning 
these numbers are 0.20, 0.30, 0.35, and 0.15. To begin a game, a player pays $5 to 
the “house” to play. The spinner is then spun and the player “wins” the amount in 
the second line of the table: 


”*Wins” is in quotes, since the player must pay $5 to play. Of course, the random 
variable of interest is the gain to the player; i.e., G with the range as given in the 
third row of the table. Notice that 20% of the time the player gains —$3; 30% of 
the time the player gains —$2; 35% of the time the player gains —$1; and 15% of 
the time the player gains $7. In mathematics this sentence is expressed as 


(—3) x 0.20 + (—2) x 0.30 + (—1) x 0.35 + 7 x 0.15 = —0.50, 


which, of course, is E(G). That is, the expected gain to the player in this game is 
—$0.50. So the player expects to lose 50 cents per play. We say a game is a fair 
game, if the expected gain is 0. So this spinner game is not a fair game. m 


Let us consider a function of a random variable X. Call this function Y = g(X). 
Because Y is a random variable, we could obtain its expectation by first finding 
the distribution of Y. However, as the following theorem states, we can use the 
distribution of X to determine the expectation of Y. 


Theorem 1.8.1. Let X be a random variable and let Y = g(X) for some function 
g. 


(a) Suppose X is continuous with pdf fx(x). If f°. |\g(x)|fx(«) dx < ov, then the 
expectation of Y exists and it is given by 


E(Y)= a g(x) fx (x) da. (1.8.2) 


—co 


(b) Suppose X is discrete with pmf px(x). Suppose the support of X is denoted 
by Sx. If Ves, |9(«)|px (x) < 00, then the expectation of Y exists and it is 
given by 


EY) = )/ 9(e)px(@). (1.8.3) 


rESx 


Proof: We give the proof in the discrete case. The proof for the continuous case 
requires some advanced results in analysis; see, also, Exercise 1.8.1. 
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Because )>,¢5, |9(#)|px (x) converges, it follows by a theorem in calculus® that 
any rearrangement of the terms of the series converges to the same limit. Thus we 
have, 


S> |9(x)|px (x 


reSx yeSy {xESx:g(x)=y} 
= > jl S” px(2) (1.8.5) 


yeSy {xESx:9(x)=y} 


= SO Wylpy@), (1.8.6) 


yeSy 


|9(z)|px (x) (1.8.4) 


where Sy denotes the support of Y. So E(Y) exists; ie., )),¢s, 9(2)px(x) con- 
verges. Because }?,¢s, 9(«)px(x) converges and also converges absolutely, the 
same theorem from calculus can be used to show that the above equations (1.8.4)— 
(1.8.6) hold without the absolute values. Hence, E(Y) = )0,¢s, 9(@)px (x), which 
is the desired result. m 

The following two examples illustrate this theorem. 


Example 1.8.4. Let Y be the discrete random variable discussed in Example 1.6.3 
and let Z =e. Since (2e)~! < 1, we have by Theorem 1.8.1 that 


E[Z] = Ble”) =Sie (3) 


= >» (fe) = =e = —- " 


Example 1.8.5. Let X be a continuous random variable with the pdf f(x) = 2a 
which has support on the interval (0,1). Suppose Y = 1/(1+X). Then by Theorem 
1.8.1, we have 


1 2 

2 Qu —2 

Br) = [ e ax = | t du = 2(1 — log 2), 
0 1 +2 1 U 

where we have used the change in variable u = 1+ x in the second integral. m 

Theorem 1.8.2 shows that the expectation operator EF is a linear operator. 


Theorem 1.8.2. Let gi(X) and go(X) be functions of a random variable X. Sup- 
pose the expectations of gi(X) and go(X) exist. Then for any constants ky and ka, 
the expectation of kigi(X) + k2g2(X) exists and it is given by 


Elkigi(X) + kogo(X)] = ki Elgi(X)] + koE[g2(X)]. (1.8.7) 


6For example, see Chapter 2 on infinite series in Mathematical Comments, referenced in the 
Preface. 
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Proof: For the continuous case, existence follows from the hypothesis, the triangle 
inequality, and the linearity of the integral; i-e., 


/. ROR OTOL ae ial ORM OL: 


+ al . \go(x)| fx (a) da < oo. 


The result (1.8.7) follows similarly using the linearity of the integral. The proof for 
the discrete case follows likewise using the linearity of sums. m 


The following examples illustrate these theorems. 


Example 1.8.6. Let X have the pdf 
21-2) O0<a<l 
Oa ah 


0 elsewhere. 
Then 
os) it 1 
B(x) = f af(x)de = [ wrxa-oar= 35, 
=%& 0 
fore) al 
B(x?) =f aPy(e)de =f (e)200-2)de= 5, 
= 0 
and, of course, 
E(6X +3X7) =6 : +3 eye a 
ONS by 2 
Example 1.8.7. Let X have the pmf 
fF e=1,28 
ae) { 0 elsewhere. 
Then 
: 301 
E(6X° + X) = 6E(X°) + E(X) =6 ae 
E(6X* + X) (X3) + YH )+ >) p(2) ==. 


Example 1.8.8. Let us divide, at random, a horizontal line segment of length 5 
into two parts. If X is the length of the left-hand part, it is reasonable to assume 


that X has the pdf 
L O0<a<5 
—) 5 
f(z) = { 0 elsewhere. 


The expected value of the length of X is E(X) = 3 and the expected value of the 
length 5—a is H(5—2) = 5. But the expected value of the product of the two 
lengths is equal to 


5 

E[X(5— X)| = fy 25 — 2)(§) de = 3 # (5). 
That is, in general, the expected value of a product is not equal to the product of 
the expected values. 
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1.8.1 R Computation for an Estimation of the Expected Gain 


In the following example, we use an R function to estimate the expected gain in a 
simple game. 


Example 1.8.9. Consider the following game. A player pays po to play. He then 
rolls a fair 6-sided die with the numbers 1 through 6 on it. If the upface is a 1 or a 
2, then the game is over. Otherwise, he flips a fair coin. If the coin toss results in a 
tail, he receives $1 and the game is over. If, on the other hand, the coin toss results 
in a head, he draws 2 cards without replacement from a standard deck of 52 cards. 
If none of the cards is an ace, he receives $2, while he receives $10 or $50 if gets 1 
or 2 aces, respectively. In both cases, the game is over. Let G denote the player’s 
gain. To determine the expected gain, we need the distribution of G. The support 
of G is the set {—po, 1 — po, 2— po, 10 — po, 50 — po}. For the associated probabilities 
we need the distribution of X, where X is the number of aces in a draw of 2 cards 
from a standard deck of 52 cards without replacement. This is another example of 
the hypergeometric distribution discussed in Example 1.6.2. For our situation, the 
distribution is 


P(X =2) = 


Using this formula, the probabilities of X, to 4 places, are 0.8507, 0.1448, and 0.0045 
for x equal to 0, 1, and 2, respectively. Using these probabilities and independence, 
the distribution and expected value of G can be determined; see Exercise 1.8.13. 
Suppose, however, a person does not have this expertise. Such a person would 
observe the game a number of times and then use the average of the observed gains 
as his/her estimate of E(G). We will show in Chapter 2 that this estimate, in 
a probability sense, is close to E(G), as the number of times the game is played 
increases. To compute this estimation, we use the following R function, simplegame, 
which plays the game and returns the gain. This function can be downloaded at the 
site given in the Preface. The argument of the function is the amount the player 
pays to play. Also, the third line of the function computes the distribution of the 
above random variable X. To draw from a discrete distribution, the code makes 
use of the R function sample which was discussed previously in Example 1.4.12. 


simplegame <- function(amtpaid) { 
gain <- -amtpaid 
x <- 0:2; pace <- (choose(4,x)*choose (48, 2-x))/choose (52,2) 
x <- sample(1:6,1,prob=rep(1/6,6)) 


dete > 2)t 
y <- sample(0:1,1,prob=rep(1/2,2)) 
if (y==0) { 
gain <- gain + 1 
} else { 


z <- sample(0:2,1,prob=pace) 
if (z==0){gain <- gain + 2} 

if (z==1){gain <- gain + 10} 
if (z==2){gain <- gain + 50} 
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a 
} 
return (gain) 


Z 


The following R script obtains the average gain for a sample of 10,000 games. For 
the example, we set the amount the player pays at $5. 


amtpaid <- 5; numtimes <- 10000; gains <- c() 
for(i in 1:numtimes){gains <- c(gains,simplegame(amtpaid) )} 
mean (gains) 


When we ran this script, we obtained —3.5446 as our estimate of E(G). Exercise 
1.8.13 shows that E(G) = —3.54. m 


EXERCISES 


1.8.1. Our proof of Theorem 1.8.1 was for the discrete case. The proof for the con- 
tinuous case requires some advanced results in in analysis. If, in addition, though, 
the function g(a) is one-to-one, show that the result is true for the continuous case. 
Hint: First assume that y = g(a) is strictly increasing. Then use the change-of- 
variable technique with Jacobian dx/dy on the integral f) . sy 9%) Fx (x) da. 


1.8.2. Consider the random variable X in Example 1.8.5. As in the example, let 
Y =1/(1+X). In the example we found the E(Y) by using Theorem 1.8.1. Verify 
this result by finding the pdf of Y and use it to obtain the E(Y). 


1.8.3. Let X have the pdf f(x) = (w+ 2)/18, —2 < « < 4, zero elsewhere. Find 
E(X), E[(X +23], and E[6X — 2(X + 2)3). 


1.8.4. Suppose that p(x) = z, x = 1,2,3,4,5, zero elsewhere, is the pmf of the 
discrete-type random variable X. Compute E(X) and E(X*). Use these two results 
to find E[(X + 2)?] by writing (X +2)? = X?4+4X +4. 


1.8.5. Let X be a number selected at random from a set of numbers {51,52,..., 100}. 
Approximate E(1/X). 
Hint: Find reasonable upper and lower bounds by finding integrals bounding F:(1/X). 


1.8.6. Let the pmf p(x) be positive at x = —1,0,1 and zero elsewhere. 
(a) If p(0) = 4, find E(X°). 
(b) If p(0) = 4 


1.8.7. Let X have the pdf f(x) = 327, 0 < x < 1, zero elsewhere. Consider a 
random rectangle whose sides are X and (1— X). Determine the expected value of 
the area of the rectangle. 


and if E(X) = 4, determine p(—1) and p(1). 
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1.8.8. A bowl contains 10 chips, of which 8 are marked $2 each and 2 are marked 
$5 each. Let a person choose, at random and without replacement, three chips from 
this bowl. If the person is to receive the sum of the resulting amounts, find his 
expectation. 


1.8.9. Let f(x) = 27, 0 < xa <1, zero elsewhere, be the pdf of X. 
(a) Compute E(1/X). 
(b) Find the cdf and the pdf of Y = 1/X. 
(c) Compute E(Y) and compare this result with the answer obtained in part (a). 


1.8.10. Two distinct integers are chosen at random and without replacement from 
the first six positive integers. Compute the expected value of the absolute value of 
the difference of these two numbers. 


1.8.11. Let X have a Cauchy distribution which has the pdf 


CT er (1.8.8) 


eee 
Then X is symmetrically distributed about 0 (why?). Why isn’t E(X) = 0? 
1.8.12. Let X have the pdf f(x) = 3x7, 0 < x < 1, zero elsewhere. 

(a) Compute E(X°). 

(b) Show that Y = X° has a uniform(0, 1) distribution. 

(c) Compute E(Y) and compare this result with the answer obtained in part (a). 


1.8.13. Using the probabilities discussed in Example 1.8.9 and independence, de- 
termine the distribution of the random variable G, the gain to a player of the game 
when he pays po dollars to play. Show that E(G) = —$3.54 if the player pays $5 to 
play. 


1.8.14. A bowl contains five chips, which cannot be distinguished by a sense of 
touch alone. Three of the chips are marked $1 each and the remaining two are 
marked $4 each. A player is blindfolded and draws, at random and without replace- 
ment, two chips from the bowl. The player is paid an amount equal to the sum of 
the values of the two chips that he draws and the game is over. Suppose it costs 
po dollars to play the game. Let the random variable G be the gain to a player of 
the game. Determine the distribution of G and the E(G). Determine po so that 
the game is fair. The R code sample(c(1,1,1,4,4) ,2) computes a sample for this 
game. Expand this into an R function that simulates the game. 
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1.9 Some Special Expectations 


Certain expectations, if they exist, have special names and symbols to represent 
them. First, let X be a random variable of the discrete type with pmf p(x). Then 


E(X) = = ale), 


If the support of X is {a1,a2,a3,...}, it follows that 
E(X) = aip(a1) + azp(a2) + asp(a3) +++- - 


This sum of products is seen to be a “weighted average” of the values of a1, a2, a3,..., 
the “weight” associated with each a; being p(a;). This suggests that we call F(X) 
the arithmetic mean of the values of X, or, more simply, the mean value of X (or 
the mean value of the distribution). 


Definition 1.9.1 (Mean). Let X be a random variable whose expectation exists. 
The mean value 4: of X is defined to be w= E(X). 


The mean is the first moment (about 0) of a random variable. Another special 
expectation involves the second moment. Let X be a discrete random variable with 
support {a1,a@2,...} and with pmf p(a), then 


E((X—4)?] = So(e-4)?p(2) 


zx 


= (a1 —)*p(a1) + (a2 — #)*p(az) +-->. 


This sum of products may be interpreted as a “weighted average” of the squares of 
the deviations of the numbers a,,a2,... from the mean value ys of those numbers 
where the “weight” associated with each (a; — 1)? is p(a;). It can also be thought of 
as the second moment of X about yz. This is an important expectation for all types 
of random variables, and we usually refer to it as the variance of X. 


Definition 1.9.2 (Variance). Let X be a random variable with finite mean fu and 
such that E[(X —1)?] is finite. Then the variance of X is defined to be E[(X —1)”]. 
It is usually denoted by o? or by Var(X). 


It is worthwhile to observe that Var(X) equals 
o? = BU(X — )?] = B(X? — 2X +p). 


Because E is a linear operator it then follows that 


o = E(X7)— h(x) +? 
= E(X?)-2y? +p? 
= B(X*)—2?. 


This frequently affords an easier way of computing the variance of X. 
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It is customary to call o (the positive square root of the variance) the standard 
deviation of X (or the standard deviation of the distribution). The number o 
is sometimes interpreted as a measure of the dispersion of the points of the space 
relative to the mean value yu. If the space contains only one point k for which 
p(k) > 0, then p(k) = 1, w= k, anda =0. 

While the variance is not a linear operator, it does satisfy the following result: 


Theorem 1.9.1. Let X be a random ravariable with finite mean pw and variance 


o?. Then for all constants a and b, 


Var(aX +b) = a? Var(X). (1.9.1) 
Proof. Because F is linear, E(aX + b) = aj + b. Hence, by definition 
Var(aX +b) = E{[(aX +b) — (a+ )]?} = E {a?[X — p]?} = a? Var(X). 


| 
Based on this theorem, for standard deviations, o4x+» = |alox. The following 
example illustrates these points. 


Example 1.9.1. Suppose the random variable X has a uniform distribution, (1.7.4), 
with pdf fx(x) = 1/(2a), —a < x < a, zero elsewhere. Then the mean and variance 
of X are: 


a 

ll 
ae 

8 

S| 

Q 

8 

ll 

wo | 
| 


—a 


so that ox = a/ /3 is the standard deviation of the distribution of X. Consider 
the transformation Y = 2X. Because the inverse transformation is « = y/2 and 
dx/dy = 1/2, it follows from Theorem 1.7.1 that the pdf of Y is fy(y) = 1/4a, 
—2a < y < 2a, zero elsewhere. Based on the above discussion, cy = (2a)/V3. 
Hence, the standard deviation of Y is twice that of X, reflecting the fact that the 
probability for Y is spread out twice as much (relative to the mean zero) as the 
probability for X. m 


Example 1.9.2. Let X have the pdf 


i(@+1) -l<a2<1 
oe) 
f(z) = { 0 elsewhere. 


Then the mean value of X is 


- * atl 1 
w= | fv) ar = | o> dz = 3, 


—co -1 


while the variance of X is 
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Example 1.9.3. If X has the pdf 


1 1l<e<co 
= re 
ia) = { 0 elsewhere, 


then the mean value of X does not exist, because 


are a 
| |z|— dx = lim — dx = lim (logb — log1) = ~, 
1 x boo 1 & b—oo 
which is not finite. m 
We next define a third special expectation. 


Definition 1.9.3 (Moment Generating Function). Let X be a random variable 
such that for some h > 0, the expectation of e'* exists for —h <t < h. The 
moment generating function of X is defined to be the function M(t) = E(e'*), 
for -h<t<h. We use the abbreviation mgf to denote the moment generating 
function of a random variable. 


Actually, all that is needed is that the mgf exists in an open neighborhood of 0. 
Such an interval, of course, includes an interval of the form (—h,h) for some h > 0. 
Further, it is evident that if we set t = 0, we have (0) = 1. But note that for an 
mef to exist, it must exist in an open interval about 0. 


Example 1.9.4. Suppose we have a fair spinner with the numbers 1,2, and 3 on 
it. Let X be the number of spins until the first 3 occurs. Assuming that the spins 
are independent, the pmf of X is 


1 2 e—1 
=-|[- = 1,2,3,2045 
oe)=3 (5). e= 128 


Then, using the geometric series, the mgf of X is 


[oe] al 2 a—l 1 [oe] 2 a—-l 1 2 -1 
m= Be) =e5(G) =F (5) = 4k §) 
g=1 xz=1 


provided that e'(2/3) < 1; ie., t < log(3/2). This last interval is an open interval 
of 0; hence, the mgf of X exists and is given in the final line of the above derivation. 
| 


If we are discussing several random variables, it is often useful to subscript MW 
as Mx to denote that this is the mgf of X. 

Let X and Y be two random variables with mgfs. If X and Y have the same 
distribution, i.e, Fx(z) = Fy(z) for all z, then certainly Mx(t) = My(t) ina 
neighborhood of 0. But one of the most important properties of mgfs is that the 
converse of this statement is true too. That is, mgfs uniquely identify distributions. 
We state this as a theorem. The proof of this converse, though, is beyond the scope 
of this text; see Chung (1974). We verify it for a discrete situation. 
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Theorem 1.9.2. Let X and Y be random variables with moment generating func- 
tions Mx and My, respectively, existing in open intervals about 0. Then Fx (z) = 
Fy(z) for all z € R if and only if Mx(t) = My(t) for all t € (—h,h) for some 
h>0. 


Because of the importance of this theorem, it does seem desirable to try to make 
the assertion plausible. This can be done if the random variable is of the discrete 
type. For example, let it be given that 

M(t) _— we! + ett 4 Fel 4 Fel 
is, for all real values of t, the mgf of a random variable X of the discrete type. If 
we let p(x) be the pmf of X with support {a1,a2,a3,...}, then because 


M(t) = Sree), 


we have 


st 2 2t 3 43t 4 At _ ait agt 
we + me + we + Ge" = Plarje** + plage! +--+. 


Because this is an identity for all real values of t, it seems that the right-hand 
member should consist of but four terms and that each of the four should be equal, 
respectively, to one of those in the left-hand member; hence we may take a, = 1, 
p(ai) = a: ag = 2, p(ag) = a a3 = 3, p(a3) = 3: a4 = 4, p(a4) = Te Or, more 
simply, the pmf of X is 


= x =1,2,3,4 
= TO yer 
p(x) { elsewhere. 
On the other hand, suppose X is a random variable of the continuous type. Let 
it be given that 


is the mgf of X. That is, we are given 


1 7 ; 
—— 2 d. t<l. 
— J Pflidis #2 


It is not at all obvious how f(x) is found. However, it is easy to see that a distri- 


bution with pdf 
e* 0<2< 0c 
f(z) = { 0 elsewhere 


has the mgf M(t) = (1—t)~!, t < 1. Thus the random variable X has a distribution 
with this pdf in accordance with the assertion of the uniqueness of the mef. 

Since a distribution that has an mgf M(t) is completely determined by M(t), 
it would not be surprising if we could obtain some properties of the distribution 
directly from M(t). For example, the existence of M(t) for —h < t < h implies that 
derivatives of M(t) of all orders exist at t = 0. Also, a theorem in analysis allows 
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us to interchange the order of differentiation and integration (or summation in the 
discrete case). That is, if X is continuous, 


wos OR oe Fla)de= fo Tel*f(o)dr= f ze f(a) de. 


dt dt Ue. 


Likewise, if X is a discrete random variable, 


M'(t) = ao = "ae p(2). 


Upon setting t = 0, we have in either case 
AM’ (0) = EX) =p. 
The second derivative of M(t) is 


wna f Beseayde ov Date 


so that M"(0) = E(X?). Accordingly, Var(X) equals 
o? = B(X®) — 2 = M"(0) — [M'(0)P. 
For example, if M(t) = (1—t)~', t < 1, as in the illustration above, then 
M'(t)=(1-t)-? and M"(t)=2(1-1t)-3 
Hence 
w= M'(0) =1 
and 
o? = M"(0)—-w? =2-1=1. 
Of course, we could have computed ps and o? from the pdf by 
= i xf(x)dx and o?= l- x? f(x) dx — p?, 


respectively. Sometimes one way is easier than the other. 
In general, if m is a positive integer and if M()(t) means the mth derivative of 
M(t), we have, by repeated differentiation with respect to t, 


M™) (0) = E(X™). 


Now ~ 
B(x") = fa Fe © Dero 


and the integrals (or sums) of this sort are, in ne called moments. Since 
M(t) generates the values of E(X™), m = 1,2,3,..., it is called the moment- 
generating function (mgf). In fact, we sometimes call F(X”) the mth moment of 
the distribution, or the mth moment of X. 

The next two examples concern random variables whose distributions do not 
have megfs. 
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Example 1.9.5. It is known that the series 


converges to 77/6. Then 


6 oe 
p(x) = ves @ = 1,2;3,.%. 
0 elsewhere 


is the pmf of a discrete type of random variable X. The mgf of this distribution, if 
it exists, is given by 


M(t) = Ee*) =) /e*p(z) 


cS 6et@ 
Se 
x=1 


The ratio test of calculus’ may be used to show that this series diverges if t > 0. 
Thus there does not exist a positive number h such that M(t) exists for —h <t <h. 
Accordingly, the distribution has the pmf p(x) of this example and does not have 
an megf. @ 


Example 1.9.6. Let X be a continuous random variable with pdf 


1 oil 


=-—=—, - : 1.9.2 
f(x) aaa 00 <2 <00 (1.9.2) 
This is of course the Cauchy pdf which was introduced in Exercise 1.7.24. Let t > 0 


be given. If x > 0, then by the mean value theorem, for some 0 < & < ta, 


Hence, e* > 1+ta2 > tx. This leads to the second inequality in the following 
derivation: 


7 a ¢ > - te 1 fe 
oo | OU 2+ 1 ~ Jo ge 1 
“1 t 
> | a dx = co 
9 watt 


Because ¢ was arbitrary, the integral does not exist in an open interval of 0. Hence, 
the mef of the Cauchy distribution does not exist. m 


Example 1.9.7. Let X have the mgf M(t) = e/2, —co < t < oo. As discussed in 
Chapter 3, this is the mgf of a standard normal distribution. We can differentiate 
M(t) any number of times to find the moments of X. However, it is instructive to 


’For example, see Chapter 2 of Mathematical Comments. 
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consider this alternative method. The function M(t) is represented by the following 
Maclaurin’s series:® 


vo _ 1,1 (P\,1 t2\? wie > 
i ~ CAD) WAS kI\2 


1. 3)Q) (2k —1)-++(3)(1) 5 
= 1+5t ge ee ——— eg 


foes, 


In general, though, from calculus the Maclaurin’s series for M(t) is 


M'(0), . M"(0).5 M"™)(0) on 
E(X E(X?2 E(x™ 
= ig Oe Os oP ge, . 
1! 2! m! 


Thus the coefficient of (t’/m!) in the Maclaurin’s series representation of M(t) is 
E(X'™). So, for our particular M(t), we have 


(2k)! 


~ DR? 
BO ge 0 he Bas (1.9.4) 


BX) = Oh—1)\(k~9)5+(3)(1) k =1,2,3,... (1.9.3) 


We make use of this result in Section 3.4. m 


Remark 1.9.1. As Examples 1.9.5 and 1.9.6 show, distributions may not have 
moment-generating functions. In a more advanced course, we would let i denote 
the imaginary unit, ¢ an arbitrary real, and we would define y(t) = E(e”*). This 
expectation exists for every distribution and it is called the characteristic func- 
tion of the distribution. To see why y(t) exists for all real t, we note, in the 
continuous case, that its absolute value 


< i le f(a)| de. 


—Co 


lo(t)| = | [et teaa 


However, |f()| = f(x) since f(x) is nonnegative and 


|e*"| = |costz + isintz| = V cos? ta + sin? tr = I 


Thus re 
Iv(t)| < / dna, 


Accordingly, the integral for y(t) exists for all real values of t. In the discrete 
case, a summation would replace the integral. In reference to Example 1.9.6, it can 
be shown that the characteristic function of the Cauchy distribution is given by 
p(t) = exp{—|t|}, —co < t < co. 


8See Chapter 2 of Mathematical Comments. 
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Every distribution has a unique characteristic function; and to each charac- 
teristic function there corresponds a unique distribution of probability. If X has 
a distribution with characteristic function y(t), then, for instance, if E(X) and 
E(X7) exist, they are given, respectively, by iE(X) = y'(0) and i?7E(X?) = y"(0). 
Readers who are familiar with complex-valued functions may write y(t) = M (it) 
and, throughout this book, may prove certain theorems in complete generality. 

Those who have studied Laplace and Fourier transforms note a similarity be- 
tween these transforms and M(t) and y(t); it is the uniqueness of these transforms 
that allows us to assert the uniqueness of each of the moment-generating and char- 
acteristic functions. 


EXERCISES 


1.9.1. Find the mean and variance, if they exist, of each of the following distribu- 
tions. 


(a) p(x) = arena (2), x = 0,1,2,3, zero elsewhere. 
(b) f(x) = 6a(1— 2), 0< a <1, zero elsewhere. 


(c) f(x) = 2/x?, 1 < x < 0, zero elsewhere. 


1.9.2. Let p(x) = (3)", x = 1,2,3,..., zero elsewhere, be the pmf of the random 


variable X. Find the mgf, the mean, and the variance of X. 
1.9.3. For each of the following distributions, compute P(j: — 20 < X < +20). 
(a) f(%) = 6x(1— 2x), 0< ax <1, zero elsewhere. 


(b) p(x) = ($)*, « = 1,2,3,..., zero elsewhere. 


1.9.4. If the variance of the random variable X exists, show that 
B(X?) > [B(X)P. 


1.9.5. Let a random variable X of the continuous type have a pdf f(x) whose 
graph is symmetric with respect to « = c. If the mean value of X exists, show that 
E(X)=c. 

Hint: Show that E(X — c) equals zero by writing E(X — c) as the sum of two 
integrals: one from —oo to c and the other from c to oo. In the first, let y = c — 2; 
and, in the second, z = x—c. Finally, use the symmetry condition f(c—y) = f(c+y) 
in the first. 


1.9.6. Let the random variable X have mean p, standard deviation 7, and mgf 
M(t), —h<t<h. Show that 


A= 
B( H) <0 B 
oO 


and 
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1.9.7. Show that the moment generating function of the random variable X having 
the pdf f(a) = z, —1 <2 < 2, zero elsewhere, is 


e2t_e-t 
M(i)=4 x t#0 
1 t=0. 


1.9.8. Let X be a random variable such that E[(X —b)?] exists for all real b. Show 
that E[(X — b)?] is a minimum when b = E(X). 


1.9.9. Let X be a random variable of the continuous type that has pdf f(x). If m 
is the unique median of the distribution of X and b is a real constant, show that 


b 
E(|X —b) = E(|X — ml) +2 / (b— 2)f(e) de, 


m 
provided that the expectations exist. For what value of b is E(|X — b|) a minimum? 


1.9.10. Let X denote a random variable for which E[(X — a)?] exists. Give an 
example of a distribution of a discrete type such that this expectation is zero. Such 
a distribution is called a degenerate distribution. 


1.9.11. Let X denote a random variable such that K(t) = E(t*) exists for all 
real values of t in a certain open interval that includes the point t = 1. Show that 
K(™ (1) is equal to the mth factorial moment E[X(X — 1)---(X —m+1)]. 


1.9.12. Let X be a random variable. If m is a positive integer, the expectation 
E|(X — 6)'™), if it exists, is called the mth moment of the distribution about the 
point 6. Let the first, second, and third moments of the distribution about the point 
7 be 3, 11, and 15, respectively. Determine the mean yp of X, and then find the 
first, second, and third moments of the distribution about the point pu. 


1.9.13. Let X be a random variable such that R(t) = E(e’*—°)) exists for t such 
that —h <t <h. If mis a positive integer, show that R”)(0) is equal to the mth 
moment of the distribution about the point b. 


1.9.14. Let X be a random variable with mean pz and variance o? such that the 
third moment E[(X — 1)3] about the vertical line through pz exists. The value of 
the ratio E[(X — p)?|/o° is often used as a measure of skewness. Graph each of 
the following probability density functions and show that this measure is negative, 
zero, and positive for these respective distributions (which are said to be skewed to 
the left, not skewed, and skewed to the right, respectively). 


(a) f(a) = (a@+1)/2, -1 <a <1, zero elsewhere. 
(b) f(x) = 5, -1 <2 <1, zero elsewhere. 
(c) f(#) = (1—«2)/2, -1 <a <1, zero elsewhere. 


1.9.15. Let X be a random variable with mean pz and variance o? such that the 
fourth moment E[(X — p)*] exists. The value of the ratio E[(X — y)*]/o* is often 
used as a measure of kurtosis. Graph each of the following probability density 
functions and show that this measure is smaller for the first distribution. 


1.9. Some Special Expectations 77 


(a) f(x) = 4, -1 <2 <1, zero elsewhere. 
(b) f(x) = 3(1— 2?)/4, -1 < x <1, zero elsewhere. 
1.9.16. Let the random variable X have pmf 


p z=-1,1 
pia) =4 1-2 x=0 
0 elsewhere, 


where 0 < p< 3 Find the measure of kurtosis as a function of p. Determine its 
3? 


value when p= 3, p= 4, p= 7g, and p= 75. Note that the kurtosis increases as 
p decreases. 


1.9.17. Let w(t) = log M(t), where M(t) is the mgf of a distribution. Prove that 
w'(0) = pw and w’(0) = o?. The function 7)(t) is called the cumulant generating 
function. 


1.9.18. Find the mean and the variance of the distribution that has the cdf 


0 «<0 
ie 2<a"<4 
1 4<a. 


1.9.19. Find the moments of the distribution that has mgf M(t) = (1-t)~3, t < 1. 
Hint: Find the Maclaurin series for M(t). 


1.9.20. We say that X has a Laplace distribution if its pdf is 


f= sel —0oo <t < oo. (1.9.5) 


(a) Show that the mgf of X is M(t) = (1—t?)7? for |t| < 1. 
(b) Expand M(t) into a Maclaurin series and use it to find all the moments of X. 


1.9.21. Let X be a random variable of the continuous type with pdf f(a), which 
is positive provided 0 < x < b < ov, and is equal to zero elsewhere. Show that 


b 
E(X) = | [1 — F(@)] de, 


where F(a) is the cdf of X. 
1.9.22. Let X be a random variable of the discrete type with pmf p(a) that is 
positive on the nonnegative integers and is equal to zero elsewhere. Show that 


Co 


E(X) = }o[1- F(@)], 


x=0 


where F'(x) is the cdf of X. 
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1.9.23. Let X have the pmf p(a) = 1/k, « = 1,2,3,...,k, zero elsewhere. Show 
that the megf is 
e*(1—e**) t£0 
M(t)=4 “R0=e) 
1 =0. 


1.9.24. Let X have the cdf F(a) that is a mixture of the continuous and discrete 
types, namely 


0 z<O0 
F(ia)=4 = O<e<1 
1 l<z# 


Determine reasonable definitions of 4 = E(X) and o? = var(X) and compute each. 
Hint: Determine the parts of the pmf and the pdf associated with each of the 
discrete and continuous parts, and then sum for the discrete part and integrate for 
the continuous part. 


1.9.25. Consider k continuous-type distributions with the following characteristics: 
pdf f;(x), mean y;, and variance o7, i= 1,2,...,k. Ife; > 0, i= 1,2,...,k, and 
cy +cg+-+-+cp, = 1, show that the mean and the variance of the distribution having 
pdf cr fi(w) +++» + ca fe(@) are p= DO, cms and o? = DP, cilo? + (mi — w)?), 
respectively. 

1.9.26. Let X be a random variable with a pdf f(x) and mgf M(t). Suppose f is 
symmetric about 0; i.e., f(—av) = f(x). Show that M(—t) = M(t). 


1.9.27. Let X have the exponential pdf, f(x) = 8-1! exp{—2x/}, 0 < x < ox, zero 
elsewhere. Find the mgf, the mean, and the variance of X. 


1.10 Important Inequalities 


In this section, we discuss some famous inequalities involving expectations. We 
make use of these inequalities in the remainder of the text. We begin with a useful 
result. 


Theorem 1.10.1. Let X be a random variable and let m be a positive integer. 
Suppose E|X"] exists. If k is a positive integer and k < m, then E|X*] exists. 


Proof: We prove it for the continuous case; but the proof is similar for the discrete 
case if we replace integrals by sums. Let f(x) be the pdf of X. Then 


[lett F0@)az 


ies areas / Jol f (a) dev 


|z|>1 


< [feet i le f(a) de 
< / f(a) dx + - lo f(a) dx 
< 1+ E(|X|"] <0, (1.10.1) 
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which is the the desired result. m 


Theorem 1.10.2 (Markov’s Inequality). Let u(X) be a nonnegative function of the 
random variable X. If E[u(X)] exists, then for every positive constant c, 


Elu(X)) 


Plu(X) 2 < 
Proof. The proof is given when the random variable X is of the continuous type; 


but the proof can be adapted to the discrete case if we replace integrals by sums. 
Let A = {x: u(x) > c} and let f(x) denote the pdf of X. Then 


c 


Blu(xy}= f ulays(o)ar= f 


—oo A 


u(x) f (2) ax+ | u(x) f (x) da. 


Since each of the integrals in the extreme right-hand member of the preceding 
equation is nonnegative, the left-hand member is greater than or equal to either of 
them. In particular, 


However, if « € A, then u(x) > c; accordingly, the right-hand member of the 
preceding inequality is not increased if we replace u(x) by c. Thus 


Elu(x)] > ef re) dx. 


Since 


it follows that 


which is the desired result. m 


The preceding theorem is a generalization of an inequality that is often called 
Chebyshev’s Inequality. This inequality we now establish. 


Theorem 1.10.3 (Chebyshev’s Inequality). Let X be a random variable with finite 
variance o* (by Theorem 1.10.1, this implies that the mean up = E(X) exists). Then 
for every k > 0, 
1 
P(X = p| > bo) < (1.10.2) 
or, equivalently, 
1 


P(X — p| < ko) 21-75. 
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Proof. In Theorem 1.10.2 take u(X) = (X — yp)? and c = k?o0?. Then we have 
El(X = p)* 

keg? 
Since the numerator of the right-hand member of the preceding inequality is a7, the 
inequality may be written 


P(X — p)? > k?o*] < 


1 


which is the desired result. Naturally, we would take the positive number k to be 
greater than 1 to have an inequality of interest. m 


Hence, the number 1/k? is an upper bound for the probability P(|X — | > ko). 
In the following example this upper bound and the exact value of the probability 
are compared in special instances. 


Example 1.10.1. Let X have the uniform pdf 
1 
pay=f ne “Va<e<vi 
0 elsewhere. 


Based on Example 1.9.1, for this uniform distribution, we have pp = 0 and o? = 1. 
If k= 3. we have the exact probability 


= gee. 
3/2 2/3 2 


By Chebyshev’s inequality, this probability has the upper bound 1/k? = 4. Since 


PUX—pl2 ko) =P(ixi2 3) a1- fo 1 J/3 


1 — V3/2 = 0.134, approximately, the exact probability in this case is considerably 
less than the upper bound 4. If we take k = 2, we have the exact probability 
P(\|X — p| > 20) = P(|X| > 2) = 0. This again is considerably less than the upper 
bound 1/k? = } provided by Chebyshev’s inequality. m 


In each of the instances in Example 1.10.1, the probability P(|X — u| > ko) and 
its upper bound 1/k? differ considerably. This suggests that this inequality might 
be made sharper. However, if we want an inequality that holds for every k > 0 
and holds for all random variables having a finite variance, such an improvement is 
impossible, as is shown by the following example. 


Example 1.10.2. Let the random variable X of the discrete type have probabilities 
+ at the points 2 = —1,0,1, respectively. Here pp = 0 and o? = 7 If k = 2, 
then 1/k? = 5 and P(|X — p| > ko) = P(|X| > 1) = 4. That is, the probability 
P(|X — y| > ko) here attains the upper bound 1/k? = 4. Hence the inequality 


cannot be improved without further assumptions about the distribution of X. m 


A convenient form of Chebyshev’s Inequality is found by taking ko = ¢ for € > 0. 
Then Equation (1.10.2) becomes 


o2 


P(\X —pl>e< @ for alle >0. (1.10.3) 


The second inequality of this section involves convex functions. 
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Definition 1.10.1. A function ¢ defined on an interval (a,b), -0o <a<b<o, 
is said to be a convex function if for all x,y in (a,b) and for all0<y <1, 


olyz + (1 — y)y] S 7O() + (1 — 7) 6(y)- (1.10.4) 
We say > is strictly convex if the above inequality is strict. 


Depending on the existence of first or second derivatives of ¢, the following 
theorem can be proved. 


Theorem 1.10.4. If ¢ is differentiable on (a,b), then 

(a) & ts convex if and only if d'(x) < d'(y), for alla<a<y<b, 

(b) @ is strictly convex if and only if ¢'(x) < @'(y), for alla<a<y<ob. 
If & is twice differentiable on (a,b), then 

(a) & ts convex if and only if 6" (x) > 0, for alla<a<b, 

(b) @ is strictly convex if 6"(a) > 0, for alla<a<b. 


Of course, the second part of this theorem follows immediately from the first 
part. While the first part appeals to one’s intuition, the proof of it can be found in 
most analysis books; see, for instance, Hewitt and Stromberg (1965). A very useful 
probability inequality follows from convexity. 


Theorem 1.10.5 (Jensen’s Inequality). If 6 is convex on an open interval I and 
X is a random variable whose support is contained in I and has finite expectation, 
then 

PE(X)] < Elo(X)]. (1.10.5) 


If @ is strictly convex, then the inequality is strict unless X is a constant random 
variable. 


Proof: For our proof we assume that ¢ has a second derivative, but in general only 
convexity is required. Expand ¢(x) into a Taylor series about pp = E[X] of order 2: 


) 4 MOC = HP 


d(x) = o(u) + ¢ (wu) (a — b 5 


where ¢ is between x and pu. Because the last term on the right side of the above 
equation is nonnegative, we have 


o(x) = o(u) + o'(u) (a — p). 


Taking expectations of both sides leads to the result. The inequality is strict if 
(x) > 0, for all x € (a,b), provided X is not a constant. m 


°See, for example, the discussion on Taylor series in Mathematical Comments referenced in the 
Preface. 
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Example 1.10.3. Let X be a nondegenerate random variable with mean ys and a 
finite second moment. Then pi? < E(X?). This is obtained by Jensen’s inequality 
using the strictly convex function ¢(t) = t?. m 


The last inequality concerns different means of finite sets of positive numbers. 


Example 1.10.4 (Harmonic and Geometric Means). Let {a1,...,@n} be a set of 
positive numbers. Create a distribution for a random variable X by placing weight 
1/n on each of the numbers aj,...,@,. Then the mean of X is the arithmetic 
mean, (AM), E(X) =n7'37"_, a;. Then, since —logz is a convex function, we 
have by Jensen’s inequality that 


— log (: Sa) < E(-log X) =—- 7 es a; = — log(ayag-++an)/” 


w=1 


or, equivalently, 


1 nm 
l —_ i) >1 Loc ie 
ve(2) o) 3 oor Gp) 


and, hence, 
1 n 
Lig f/m a 1.10. 
(aya2°++an)'™ < = > Qi (1.10.6) 


The quantity on the left side of this inequality is called the geometric mean (GM). 
So (1.10.6) is equivalent to saying that GM < AM for any finite set of positive 
numbers. 

Now in (1.10.6) replace a; by 1/a; (which is also positive). We then obtain 


or, equivalently, 
1 


tye ls 


The left member of this inequality is called the harmonic mean (HM). Putting 
(1.10.6) and (1.10.7) together, we have shown the relationship 


< (ajageiag ft!” (1.10.7) 


HM < GM < AM, (1.10.8) 


for any finite set of positive numbers. 


EXERCISES 


1.10.1. Let X be a random variable with mean py and let E[(X — y)?*] exist. 
Show, with d > 0, that P(|X — y| > d) < E[(X — p)?*]/d?*. This is essentially 
Chebyshev’s inequality when k = 1. The fact that this holds for all k = 1,2,3,..., 
when those (2k)th moments exist, usually provides a much smaller upper bound for 
P(|X — p| > d) than does Chebyshev’s result. 


1.10. Important Inequalities 83 


1.10.2. Let X be a random variable such that P(X < 0) = 0 and let pp = E(X) 
exist. Show that P(X > 2y) < 4. 


1.10.3. If X is a random variable such that E(X) = 3 and E(X?) = 13, use 
Chebyshev’s inequality to determine a lower bound for the probability P(—2 < 
X <8). 


1.10.4. Suppose X has a Laplace distribution with pdf (1.9.20). Show that the 
mean and variance of X are 0 and 2, respectively. Using Chebyshev’s inequality 
determine the upper bound for P(|X| > 5) and then compare it with the exact 
probability. 


1.10.5. Let X be a random variable with mgf M(t), —h <t<h. Prove that 
P(X >a)<e“M(t), 0<t<h, 


and that 
P(X <a)<e“M(t), -h<t<0. 


Hint: Let u(x) = e and c = e’* in Theorem 1.10.2. Note: These results imply 
that P(X > a) and P(X <a) are less than or equal to their respective least upper 
bounds for e~“' M(t) when 0 < t < h and when —h <t <0. 


1.10.6. The megf of X exists for all real values of t and is given by 


M@) = t#0, M(0)=1. 


Use the results of the preceding exercise to show that P(X > 1) =0 and P(X < 
—1)=0. Note that here h is infinite. 


1.10.7. Let X be a positive random variable; i.e., P(X <0) =0. Argue that 
(a) E(1/X) > 1/E(X) 
(b) E[-log X] > — log[E(X)] 
(c) Eflog(1/X)] = log[1/E(X)| 
(d) E[X*] = [E(X)/’. 
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Chapter 2 


Multivariate Distributions 


2.1 Distributions of Two Random Variables 


We begin the discussion of a pair of random variables with the following example. A 
coin is tossed three times and our interest is in the ordered number pair (number of 
H’s on first two tosses, number of H’s on all three tosses), where H and T represent, 
respectively, heads and tails. Let C = {TTT,TTH,THT, HTT,THH, HTH, HHT, 
HHH} denote the sample space. Let X, denote the number of H’s on the first 
two tosses and X2 denote the number of H’s on all three flips. Then our inter- 
est can be represented by the pair of random variables (X1, X2). For example, 
(Xi (HTH), X2(HTH)) represents the outcome (1,2). Continuing in this way, X1 
and X92 are real-valued functions defined on the sample space C, which take us from 
the sample space to the space of ordered number pairs. 


D = {(0,0), (0, 1), (1,1), (1, 2), (2, 2), (2, 3)}. 


Thus X, and X»2 are two random variables defined on the space C, and, in this 
example, the space of these random variables is the two-dimensional set D, which is 
a subset of two-dimensional Euclidean space R?. Hence (X1, X2) is a vector function 
from C to D. We now formulate the definition of a random vector. 


Definition 2.1.1 (Random Vector). Given a random experiment with a sample 
space C, consider two random variables X; and X2, which assign to each element 
c of C one and only one ordered pair of numbers X1(c) = 41, Xo(c) = x2. Then 
we say that (X1,X2) is arandom vector. The space of (X,,X2) is the set of 
ordered pairs D = {(a1,%2) : #1 = X1(c), vq = Xo(c),c EC}. 


We often denote random vectors using vector notation X = (X 1, X2)', where 
the ’ denotes the transpose of the row vector (X 1, X2). Also, we often use (X,Y) 
to denote random vectors. 

Let D be the space associated with the random vector (Xi, X2). Let A be a 
subset of D. As in the case of one random variable, we speak of the event A. We 
wish to define the probability of the event A, which we denote by Px,,x,[A]. As 
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with random variables in Section 1.5 we can uniquely define Px,,x, in terms of the 
cumulative distribution function (cdf), which is given by 


Fx, ,.x>(1,%2) = P[{X <a} {Xo < x9}], (2.1.1) 


for all (21,22) € R?. Because X; and X2 are random variables, each of the events 
in the above intersection and the intersection of the events are events in the original 
sample space C. Thus the expression is well defined. As with random variables, we 
write P[{X1 < a1} A {Xe < xo}] as P[Xy < 11, Xo < x]. As Exercise 2.1.3 shows, 


Play < X1 < bi,a2 < X2g< be] = Fy,,x,(b1,b2) — Fx, ,x, (a1, be) 
—Fy, x, (b1, a2) + Fx, ,x, (a1, a2). (2.1.2) 


Hence, all induced probabilities of sets of the form (a1, bi] x (a2, b2] can be formulated 
in terms of the cdf. We often call this cdf the joint cumulative distribution 
function of (Xj, X2). 

As with random variables, we are mainly concerned with two types of random 
vectors, namely discrete and continuous. We first discuss the discrete type. 

A random vector (X1, X2) is a discrete random vector if its space D is finite 
or countable. Hence, X; and X2 are both discrete also. The joint probability 
mass function (pmf) of (X1, X2) is defined by 


PX1,Xe (a1, 22) = PIX = r1,X2 = xa], (2.1.3) 


for all (41,22) € D. As with random variables, the pmf uniquely defines the cdf. It 
also is characterized by the two properties 


(i) 0 < px, x. (%1,%2) < 1 and (ii) SO Sopx, x, (#1, 22) = 1. (2.1.4) 
D 
For an event B € D, we have 


P|(X1, Xo) € B) = S2 So pxy,xX2 (#1, 22). 
B 


Example 2.1.1. Consider the example at the beginning of this section where a fair 
coin is flipped three times and X, and X2 are the number of heads on the first two 
flips and all 3 flips, respectively. We can conveniently table the pmf of (X1, X2) as 


Support of X2 


Support of X, 


For instance, P(X1 > 2, X2 > 2) = p(2,2) + p(2,3) = 2/8. = 
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At times it is convenient to speak of the support of a discrete random vec- 
tor (Xj, X2). These are all the points (% 1,22) in the space of (X1,X2) such 
that p(z1,22) > 0. In the last example the support consists of the six points 
{(0,0), (0,1), (1,1), (1, 2), (2,2), (2,3) }- 

We say a random vector (X1, X2) with space D is of the continuous type if its 
cdf Fx, x,(@1, £2) is continuous. For the most part, the continuous random vectors 
in this book have cdfs that can be represented as integrals of nonnegative functions. 
That is, Fx, x. (#1, 2) can be expressed as 


©41 hp 
Fx, x, (@1,%2) = / / fx,,x.(W1, W2) dwi dw, (2.1.5) 


for all (21,22) € R?. We call the integrand the joint probability density func- 
tion (pdf) of (X1, X2). Then 


OP Fx, x5 (wi, ©) 
Ox, Orn = fx,,X2 (11, %2), 


except possibly on events that have probability zero. A pdf is essentially character- 
ized by the two properties 


(i) fx,,X2 (21, £2) = 0 and (ii) I te fx1,X2 (1, £2) dx dx2 =1. (2.1.6) 


For the reader’s benefit, Section 4.2 of the accompanying resource Mathematical 
Comments: offers a short review of double integration. For an event A € D, we 
have 


P[(X1, X2) € Al = ff txxe(er12) dere, 


Note that the P[(X,, X2) € A] is just the volume under the surface z = fx, x, (1, 2) 
over the set A. 


Remark 2.1.1. As with univariate random variables, we often drop the subscript 
(X1, X2) from joint cdfs, pdfs, and pmfs, when it is clear from the context. We also 
use notation such as fi2 instead of fx,,x,. Besides (X),X2), we often use (X,Y) 
to express random vectors. 


We next present two examples of jointly continuous random variables. 


Example 2.1.2. Consider a continuous random vector (X,Y) which is uniformly 
distributed over the unit circle in R?. Since the area of the unit circle is 7, the joint 


pdf is 
_f 4 -1l<y<1,-V1l-yY<2<J/1-¥ 
f(a, y) > a ‘ 
0 elsewhere. 

Probabilities of certain events follow immediately from geometry. For instance, let 
A be the interior of the circle with radius 1/2. Then P[(X,Y) € A] = 2(1/2)?/7 = 
1/4. Next, let B be the ring formed by the concentric circles with the respective 
radii of 1/2 and V2/2. Then P[(X,Y) € B] = a|(V2/2)? — (1/2)?]/m = 1/4. The 
regions A and B have the same area and hence for this uniform pdf are equilikely. 
a 


1Downloadable at the site listed in the Preface. 
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In the next example, we use the general fact that double integrals can be ex- 
pressed as iterated univariate integrals. Thus double integrations can be carried 
out using iterated univariate integrations. This is discussed in some detail with 
examples in Section 4.2 of the accompanying resource Mathematical Comments.” 
The aid of a simple sketch of the region of integration is valuable in setting up the 
upper and lower limits of integration for each of the iterated integrals. 


Example 2.1.3. Suppose an electrical component has two batteries. Let X and Y 
denote the lifetimes in standard units of the respective batteries. Assume that the 
pdf of (X,Y) is 
fla,y) = { Acye +) 2 >0,y>0 
, 0 elsewhere. 


The surface z = f(x,y) is sketched in Figure 2.1.1 where the grid squares are 
0.1 by 0.1. From the figure, the pdf peaks at about (x,y) = (0.7,0.7). Solving 
the equations 0f/dx = 0 and Of/dOy = 0 simultaneously shows that actually the 
maximum of f(x,y) occurs at (2, y) = (2/2, V2/2). The batteries are more likely 
to die in regions near the peak. The surface tapers to 0 as x and y get large in any 
direction. for instance, the probability that both batteries survive beyond V2/ 2 
units is given by 


p(x>By>2) 


l| 


/ / Aaye (@ +9") dady 
V2/2 J/2/2 
a 2 ee 2 
= / 2xe* / 2ye % dy| dx 
V2/2 V2/2 


co Co 2 
- | en? / ran dz = (e-1?) = 0.3679, 
1/2 1/2 


where we made use of the change-in-variables z = x? and w = y?. In contrast to the 
last example, consider the regions A = {(a, y) : |ja — (1/2)| < 0.3, |y — (1/2)| < 0.3} 
and B = {(ax,y) : |c—2| < 0.3, |y—2| < 0.3}. The reader should locate these regions 
on Figure 2.1.1. The areas of A and B are the same, but it is clear from the figure 
that P[(X,Y) € A] is much larger than P[(X,Y) € B]. Exercise 2.1.6 confirms this 
by showing that P[(X,Y) € A] = 0.1879 while P[(X,Y) € B] = 0.0026. = 


For a continuous random vector (X1, X2), the support of (X1, X2) contains all 
points (#1, 22) for which f(a1,22) > 0. We denote the support of a random vector 
by S. As in the univariate case, S C D. 

As in the last two examples, we extend the definition of a pdf fx,x, (1,22) 
over R? by using zero elsewhere. We do this consistently so that tedious, repetitious 
references to the space D can be avoided. Once this is done, we replace 


J ff fxvxaleaste) devas by if / f(@1, 22) dx, dxg. 
D —co J—oo 


2Downloadable at the site listed in the Preface. 
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Figure 2.1.1: A sketch of the the surface of the joint pdf discussed in Example 
2.1.3. On the figure, the origin is located at the intersection of the 7 and z axes 
and the grid squares are 0.1 by 0.1, so points are easily located. As discussed in the 
text, the peak of the pdf occurs at the point (2/2, V2/2). 


Likewise we may extend the pmf px, _x,(21, 22) over a convenient set by using zero 
elsewhere. Hence, we replace 


Yo ipxs,x2 (21,02) by SS pla, 22). 


D Zo wy 


2.1.1 Marginal Distributions 


Let (X1, X2) be a random vector. Then both X; and X2 are random variables. 
We can obtain their distributions in terms of the joint distribution of (X1, X2) as 
follows. Recall that the event which defined the cdf of X, at a is {X, < ay}. 
However, 


{Xy = x1} = {X < r1}M{-o0 < Xo < oo} = {Xy <%,-0O < Xo < oo}. 
Taking probabilities, we have 


Fx, (41) = P[X, <@1,-00 < Xo < oo], (2.1.7) 
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Table 2.1.1: Joint and Marginal Distributions for the discrete random vector 
(X1, X2) of Example 2.1.1. 


Support of X2 


es 


3 
in, 
hs 


Colt Co} colby | —-~ 


Support of X, 


mo SO Or 


clu] CO cor cole 
Coley Joo] = cof 


for all a} € R. By Theorem 1.3.6 we can write this equation as Fy,(x1) = 
limz,too F(x1,%2). Thus we have a relationship between the cdfs, which we can 
extend to either the pmf or pdf depending on whether (X1, X2) is discrete or con- 
tinuous. 

First consider the discrete case. Let Dx, be the support of X1. For 71 € Dx,, 
Equation (2.1.7) is equivalent to 


Fy, (a1) = ae PX1,X2(W1, £2) = + a Praline) : 


W1<2%1,-W<42<00 wi<a1 \@2<00 


By the uniqueness of cdfs, the quantity in braces must be the pmf of X, evaluated 
at w 1; that is, 
Px, (21) = S- PX, ,X2 (21, 22), (2.1.8) 


LQ<0o0 


for all x; € Dx,. Hence, to find the probability that X1 is 71, keep x; fixed and 
sum px,,x, over all of x2. In terms of a tabled joint pmf with rows comprised of 
Xj, support values and columns comprised of X2 support values, this says that the 
distribution of X; can be obtained by the marginal sums of the rows. Likewise, the 
pmf of X2 can be obtained by marginal sums of the columns. 

Consider the joint discrete distribution of the random vector (X1,X2) as pre- 
sented in Example 2.1.1. In Table 2.1.1, we have added these marginal sums. The 
final row of this table is the pmf of X2, while the final column is the pmf of X,. 
In general, because these distributions are recorded in the margins of the table, we 
often refer to them as marginal pmfs. 


Example 2.1.4. Consider a random experiment that consists of drawing at random 
one chip from a bowl containing 10 chips of the same shape and size. Each chip has 
an ordered pair of numbers on it: one with (1,1), one with (2,1), two with (3, 1), 
one with (1,2), two with (2,2), and three with (3,2). Let the random variables 
X, and Xp» be defined as the respective first and second values of the ordered pair. 
Thus the joint pmf p(#1, #2) of X; and X2 can be given by the following table, with 
p(a1, 22) equal to zero elsewhere. 
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The joint probabilities have been summed in each row and each column and these 
sums recorded in the margins to give the marginal probability mass functions of X 
and X»2, respectively. Note that it is not necessary to have a formula for p(x1, x2) 
to do this. m 


We next consider the continuous case. Let Dx, be the support of X1. For 
x1 € Dx,, Equation (2.1.7) is equivalent to 


F(a) = f i fx1,X2(W1, £2) drgdwy =| {/ fra (usta) dea} dy. 


By the uniqueness of cdfs, the quantity in braces must be the pdf of X1, evaluated 
at wy ; that is, 


fal) = f fx1,X2(@1, £2) dxg (2.1.9) 


for all x; € Dx,. Hence, in the continuous case the marginal pdf of X, is found by 
integrating out x2. Similarly, the marginal pdf of X2 is found by integrating out 
Ly. 


Example 2.1.5 (Example 2.1.2, continued). Consider the vector of continuous 
random variables (X,Y) discussed in Example 2.1.2. The space of the random 
vector is the unit circle with center at (0,0) as shown in Figure 2.1.2. To find the 
marginal distribution of X, fix x between —1 and 1 and then integrate out y from 
—vV1-— 2? to V1 — 2? as the arrow shows on Figure 2.1.2. Hence, the marginal pdf 


of X is 
V1i—2? 
1 2 
f(z) = | —dy=—-V1-2?, -l<2<l. 
T 


Although (X,Y) has a joint uniform distribution, the distribution of X is unimodal 
with peak at 0. This is not surprising. Since the joint distribution is uniform, from 
Figure 2.1.2 X is more likely to be near 0 than at either extreme —1 or 1. Because 
the joint pdf is symmetric in x and y, the marginal pdf of Y is the same as that of 
X. a 


Example 2.1.6. Let X; and X2 have the joint pdf 


Fits = Utara O<a <1, 0<am<l 
ee ale | elsewhere. 
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Region of Integration for Example A.3.1. 


1.0 


0.5 


-0.5 
| 


-1.0 


Figure 2.1.2: Region of integration for Example 2.1.5. It depicts the integration 
with respect to y at a fixed but arbitrary x. 


Notice the space of the random vector is the interior of the square with vertices 
(0,0), (1,0), (1,1) and (0,1). The marginal pdf of X, is 


1 
fi(ai) -| (xy + x2) dx =%2,.+ 3, O0<a< 1, 
0 
zero elsewhere, and the marginal pdf of X2 is 
1 
fo(x2) =i (xy + x2) dx, = “ +22, 0<a<l, 
0 


zero elsewhere. A probability like P(X, < 4) can be computed from either f, (21) 
or f (1,22) because 


1/2 1 1/2 
i | f (x1, x2) dxodx1 => fi(a1) dx, = 3. 
0 0 0 


Suppose, though, we want to find the probability P(X; + X2 < 1). Notice that 
the region of integration is the interior of the triangle with vertices (0,0), (1,0) and 
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(0,1). The reader should sketch this region on the space of (X1, X2). Fixing 2; and 
integrating with respect to x2, we have 


1 1-2 
P(X + Xo < 1) = i / (Gaal + v2) de] dx, 
0 0 


Fear le 


1 
oe oe 1 
= — ig? | dey = —. 
LG ot) 3 


This latter probability is the volume under the surface f(x1,2%2) = x1 + 22 above 
the set {(a1,2%2):0< a1, a1 +a. <1}. 


I 


Example 2.1.7 (Example 2.1.3, Continued). Recall that the random variables X 
and Y of Example 2.1.3 were the lifetimes of two batteries installed in an electrical 
component. The joint pdf of (X,Y) is sketched in Figure 2.1.1. Its space is the 
positive quadrant of R? so there are no constraints involving both 2 and y. Using 
the change-in-variable w = y?, the marginal pdf of X is 


fx(a) = | Arye +9") dy = Qre~® i e "dw= Qre~’ 
0 0 


for « > 0. By the symmetry of x and y in the model, the pdf of Y is the same as 
that of X. To determine the median lifetime, 6, of these batteries, we need to solve 


1 : 2 92 
5a [ee dz=1—-—e”, 
2 0 


where again we have made use of the change-in-variables z = x7. Solving this 


equation, we obtain 0 = /log2 ~ 0.8326. So 50% of the batteries have lifetimes 
exceeding 0.83 units. m 


2.1.2 Expectation 


The concept of expectation extends in a straightforward manner. Let (X1,X2) bea 
random vector and let Y = g(X1, X2) for some real-valued function; i.e., g : R? — R. 
Then Y is a random variable and we could determine its expectation by obtaining 
the distribution of Y. But Theorem 1.8.1 is true for random vectors also. Note the 
proof we gave for this theorem involved the discrete case, and Exercise 2.1.12 shows 
its extension to the random vector case. 

Suppose (X 1, X2) is of the continuous type. Then E(Y) exists if 


/ / ay oe a Soe 


Then a nee 
E(Y) =| | g(@1, ©2) fx, .xo(%1, 2) dx dx. (2.1.10) 
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Likewise if (X1, X2) is discrete, then E(Y) exists if 


SOS lo(@1, 22) px. x2 (11,22) < 00. 


Ly xr 


Then 
E(Y) = SoS 9(41, 22)Dx1,X2 (21,22). (2.1.11) 


Z1 2 


We can now show that F is a linear operator. 


Theorem 2.1.1. Let (Xi, X2) be a random vector. Let Y, = gi(X1, X2) and Y2 = 
g2(X1, X2) be random variables whose expectations exist. Then for all real numbers 
ky and ko, 

E(k1Y, + keY2) = ky E(¥1) + ko E(Y2). (2.1.12) 


Proof: We prove it for the continuous case. The existence of the expected value of 
k,Y, + keY2 follows directly from the triangle inequality and linearity of integrals; 
Le., 


/ / |k1gi (#1, 2) + kogi (#1, ©2)|fx,,x.(©1, £2) dx dax2 
< [hl / / COACH Lt. 
+ [kel / / laa 1502) xy xa teas te) derdite < 00. 


By once again using linearity of the integral, we have 
E(kyY, +keY2) = / / (k1gi (#1, %2) + kogo(#1, 22) | fx,,x2(21, £2) dx dx2 
= ky / i gi(%1, £2) fx, ,X_ (11, £2) dx drg 


+h [ i g2(#1, 02) fx, ,X_(%1, £2) dxidx2 
= kE(¥1)+kE(Y), 
i.e., the desired result. m 


We also note that the expected value of any function g(X2) of X2 can be found 
in two ways: 


B(g(X2)) = ff glee) f(r, 0) derdra = [ glea)fxa(2) dee, 
the latter single integral being obtained from the double integral by integrating on 
x, first. The following example illustrates these ideas. 
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Example 2.1.8. Let X, and X2 have the pdf 


f(a as) = 841%. O< a <a2< 1 
aaa Fay elsewhere. 


Figure 2.1.3 shows the space for (X1, X2). Then 


E(X,X2) =| / 2103 f (21, £2) dx dx. 


To compute the integration, as shown by the arrow on Figure 2.1.3, we fix x2 and 
then integrate x; from 0 to x2. We then integrate out x2 from 0 to 1. Hence, 


[oe] co il? x2 a 
/ / 1105 f (21,22) — | ff 8xra3 ae,| dxz = i Sx§ dxz = =. 
—oo J—oo (0) 0 0 


In addition, 
1 x2 
E(X2) =| | va(Srt2) des | diz = 4. 
0 LJo 


Since X has the pdf f2(a2) = 4x73, 0 < xg < 1, zero elsewhere, the latter expecta- 
tion can also be found by 


1 
0 
Using Theorem 2.1.1, 


E(7X,X24+5X2) = TE(X,X3)+5E(X. 


Example 2.1.9. Continuing with Example 2.1.8, suppose the random variable Y 
is defined by Y = X,/X 2. We determine E(Y) in two ways. The first way is by 
definition; i.e., find the distribution of Y and then determine its expectation. The 
cdf of Y, for 0 < y <1, is 


1 yxr2 
Fy(y) = P(Y <y)=P(X% < yXo2) = | | 821 x2 ae,| dx 


1 
| Ay? xs dy = y”. 
0 


Hence, the pdf of Y is 


2y O<y<l 
fy(y) = Fy(y) = { i ene 


which leads to 
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(0) (yy) 


Figure 2.1.3: Region of integration for Example 2.1.8. The arrow depicts the 
integration with respect to x; at a fixed but arbitrary x2. 


For the second way, we make use of expression (2.1.10) and find E(Y) directly by 


il z2 
= {{ (=) 82122 a} dx2 
0 ) v2 


E(Y) 


t 
& 
ae 
als 
So 
| 


We next define the moment generating function of a random vector. 


Definition 2.1.2 (Moment Generating Function of a Random Vector). Let X = 
(X1, X2)’ be a random vector. If E(e%*1+t2X2) exists for |ti| < hi and |t2| < 
ha, where hy and hg are positive, it is denoted by Mx,,x.,(ti,t2) and is called the 
moment generating function (mgf) of X. 


As in the one-variable case, if it exists, the mgf of a random vector uniquely 
determines the distribution of the random vector. 
Let t = (t1,t2)’. Then we can write the mgf of X as 


Mx, x, (t)=E [e’*| (2.1.13) 
so it is quite similar to the mgf of a random variable. Also, the mgfs of X, and X2 


are immediately seen to be Mx, _x,(t1,0) and Mx,_x,(0,t2), respectively. If there 
is no confusion, we often drop the subscripts on M. 
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Example 2.1.10. Let the continuous-type random variables X and Y have the 
joint pdf 

Je’ 0<a%<y<ow 
f(x,y) = { 0 elsewhere. 


The reader should sketch the space of (X,Y). The mef of this joint distribution is 


M(ti,t2) = | ff exp(tia + toy — y) dy| dx 
0 x 
1 
(1—t, —te)(1—te)’ 


provided that t; + tg < 1 and tg < 1. Furthermore, the moment-generating func- 
tions of the marginal distributions of X and Y are, respectively, 


1 
M(t1,0) = Te? t<i1 


1 
M (0, te) = ce te < it, 


These moment-generating functions are, of course, respectively, those of the 
marginal probability density functions, 


fila) = f etdgse*, <r <a, 


zero elsewhere, and 


7] 
foly) = ew f dz=ye¥, 0<y<om, 
0 


zero elsewhere. @ 


We also need to define the expected value of the random vector itself, but this 
is not a new concept because it is defined in terms of componentwise expectation: 


Definition 2.1.3 (Expected Value of a Random Vector). Let X = (X1,X2)/ be a 
random vector. Then the expected value of X exists if the expectations of X, and 
X» exist. If it exists, then the expected value is given by 


E(X1) | 


E(Xs) (2.1.14) 


EXERCISES 


2.1.1. Let f(a1, 22) = 40,22, 0< 41 <1, 0 < x < 1, zero elsewhere, be the pdf 
of X, and X9. Find P(0 < Xi < $, ¢ < X2 <1), P(X1 = X2), P(X < X2), and 
Pixy = Xs), 

Hint: Recall that P(X, = X2) would be the volume under the surface f(x1, 22) = 
4x22 and above the line segment 0 < 2, = x2 < 1 in the x, %2-plane. 
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2.1.2. Let Ay = {(x,y) : a < 2, y < 4}, Ao = {(@,y) 2 a < 2, y < 1}, As = 
{(z,y): @ < 0, y < 4}, and Ay = {(x,y) : « < Oy < 1} be subsets of the 
space A of two random variables X and Y, which is the entire two-dimensional 
plane. If P(A1) = ¢, P(A2) = 4, P(A3) = 2, and P(A4) = 2, find P(A), where 
As ={(a,y):0<a<2,1<y< 4}. 


2.1.3. Let F (x,y) be the distribution function of X and Y. For all real constants 
a<b, c<d, show that P(ia< X <b, c<Y <d) = F(b,d) — F(b,c) — F(a,d) + 
F(a,c). 

2.1.4. Show that the function F(x, y) that is equal to 1 provided that x + 2y > 1, 
and that is equal to zero provided that x+2y < 1, cannot be a distribution function 


of two random variables. 
Hint: Find four numbers a < b, c < d, so that 


F'(b,d) — F(a,d) — F(b,c) + F(a,c) 
is less than zero. 


2.1.5. Given that the nonnegative function g(x) has the property that 


[ seyaz =1, 


29(/ 27 + 23) 
my/ at + x3 
zero elsewhere, satisfies the conditions for a pdf of two continuous-type random 


variables X, and Xo. 
Hint: Use polar coordinates. 


show that 
f(z, £2) = 


,0< 4%, <w,0<12%2< Ww, 


2.1.6. Consider Example 2.1.3. 
(a) Show that P(a < X < b,c < Y < d) = (exp{—a?} — exp{—b?})(exp{—c?} — 
exp{—0?}). 
(b) Using Part (a) and the notation in Example 2.1.3, show that P[(X,Y) € A] = 
0.1879 while P[(X,Y) € B] = 0.0026. 


(c) Show that the following R program computes P(a < X < b,c < Y < d). 
Then use it to compute the probabilities in Part (b). 


plifetime <- function(a,b,c,d) 
{(exp(-a°2) - exp(-b*2))*(exp(-c*2) - exp(-d*2))} 


2.1.7. Let f(x,y) =e"* 4, 0< 4 < co, 0< y < ~, zero elsewhere, be the pdf of 
X and Y. Then if Z= X+Y, compute P(Z < 0), P(Z <6), and, more generally, 
P(Z < z), for 0 < z < co. What is the pdf of Z? 
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2.1.8. Let X and Y have the pdf f(z,y) =1, 0<a<1,0<y< 1, zero elsewhere. 
Find the cdf and pdf of the product Z = XY. 


2.1.9. Let 13 cards be taken, at random and without replacement, from an ordinary 
deck of playing cards. If X is the number of spades in these 13 cards, find the pmf of 
X. If, in addition, Y is the number of hearts in these 13 cards, find the probability 
P(X = 2, Y =5). What is the joint pmf of X and Y? 


2.1.10. Let the random variables X, and X2 have the joint pmf described as follows: 


(21,22) (0, 0) (0,1) (0, 2) (1, 0) (1,0) (1 


2) 


ble 
ble 
ble 


Pt,%)| e bt B 
and p(a1, £2) is equal to zero elsewhere. 
(a) Write these probabilities in a rectangular array as in Example 2.1.4, recording 
each marginal pdf in the “margins.” 
(b) What is P(X, + X2 = 1)? 


2.1.11. Let X, and X2 have the joint pdf f(21, 22) = 15r7x%2, 0 < 41 < a2 <1, 
zero elsewhere. Find the marginal pdfs and compute P(X, + X2 < 1). 

Hint: Graph the space X; and X2 and carefully choose the limits of integration 
in determining each marginal pdf. 


2.1.12. Let X1, X2 be two random variables with the joint pmf p(a1, x2), (a1, 72) € 
S, where S is the support of X1, X2. Let Y = g(X1, X2) be a function such that 


SSS |g(w1, 22) |p(a1, #2) < 00. 


(a1,02)ES 


By following the proof of Theorem 1.8.1, show that 


E(Y) = S°>° o(a1, 22)p(21, 22). 


(a1,v2)ES 


2.1.13. Let X1,X2 be two random variables with the joint pmf p(a1, 22) = (a1 + 
x2)/12, for 7] = 1,2, v2 = 1,2, zero elsewhere. Compute E(X1), E(X?), E(X2), 
E(X2), and E(X1X2). Is E(X1X2) = E(X1)E(X2)? Find E(2X, —6X34+7X)X2). 


2.1.14. Let X1,X2 be two random variables with joint pdf f(a1,72) = 42129, 
0< 2 < 1,0 < x2 <1, zero elsewhere. Compute E(X1), E(X?), E(X2), E(X3), 
and E(X1X2). Is E(X1X2) = E(X1)E(X2)? Find E(3X2 = 2x? + 6X1X2). 
2.1.15. Let X1, X2 be two random variables with joint pmf p(1, 72) = (1/2), 
for 1 < a < c0o,i = 1,2, where x; and x2 are integers, zero elsewhere. Determine 
the joint megf of Xi, Xo. Show that M(t, tz) = M(t, 0)M(0, ta). 

2.1.16. Let X1, X2 be two random variables with joint pdf f(a1, 72) = v1 exp{—22}, 
for 0 < 2, < 2 < ov, zero elsewhere. Determine the joint mgf of X,,X2. Does 
M (ti, t2) = M(t1,0)M(0, ta)? 

2.1.17. Let X and Y have the joint pdf f(a,y) =6(l1—aw—y), v+y<1,0<a2, 
0 < y, zero elsewhere. Compute P(2X +3Y <1) and E(XY + 2X7). 
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2.2 Transformations: Bivariate Random Variables 


Let (X1,X2) be a random vector. Suppose we know the joint distribution of 
(X1, X2) and we seek the distribution of a transformation of (X1, X2), say, Y = 
g(X1, X2). We may be able to obtain the cdf of Y. Another way is to use a trans- 
formation as we did for univariate random variables in Sections 1.6 and 1.7. In this 
section, we extend this theory to random vectors. It is best to discuss the discrete 
and continuous cases separately. We begin with the discrete case. 

There are no essential difficulties involved in a problem like the following. Let 
DX1,Xo(L1,%2) be the joint pmf of two discrete-type random variables X; and X2 
with S the (two-dimensional) set of points at which px, x, (1,22) > 0; i.e., S is the 
support of (X1, X2). Let y: = u1(a1, 22) and yo = u2(#1, 22) define a one-to-one 
transformation that maps S onto T. The joint pmf of the two new random variables 
Y, = ui(Xq, X2) and Yo = u(X1, X2) is given by 


_ PX,,X2[W1(Y1, Y2), W2(Y1, Y2)] (y1,y2) € T 
PY1,Y2 (yi, y2) = 0 elsewhere 


where 21 = wi (yi, y2), La = We(y1, y2) is the single-valued inverse of y; = ui (#1, 22), 
Y2 = U2(#1,%2). From this joint pmf py, y,(y1, y2) we may obtain the marginal pmf 
of Y; by summing on y2 or the marginal pmf of Y2z by summing on 1. 

In using this change of variable technique, it should be emphasized that we need 
two “new” variables to replace the two “old” variables. An example helps explain 
this technique. 


Example 2.2.1. In a large metropolitan area during flu season, suppose that two 
strains of flu, A and B, are occurring. For a given week, let X; and X2 be the 
respective number of reported cases of strains A and B with the joint pmf 
T1 |, 82 o—H1 e— U2 
fy" Me Me 
DX,,X_(21, 22) = a x21 =0,1,2,3,..., #2 =0,1,2,3,..., 
L109: 

and is zero elsewhere, where the parameters ju; and jig are positive real numbers. 
Thus the space S is the set of points (71,272), where each of 2; and x2 is a non- 
negative integer. Further, repeatedly using the Maclaurin series for the exponential 
function,? we have 


love) 21 love) w2 
E(x) = e S- Ly ML onus se 
=O ZI: e520 XQ: 
co vy—-1 
= eh Dein 1 ‘hese 
fy=1 


Thus pi, is the mean number of cases of Strain A flu reported during a week. 
Likewise, 2 is the mean number of cases of Strain B flu reported during a week. 


3See for example the discussion on Taylor series in Mathematical Comments as referenced in 
the Preface. 
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A random variable of interest is Y; = X, + XQ; i.e., the total number of reported 
cases of A and B flu during a week. By Theorem 2.1.1, we know E(Y1) = pi + pa; 
however, we wish to determine the distribution of Y;. If we use the change of 
variable technique, we need to define a second random variable Y2. Because Y2 is 
of no interest to us, let us choose it in such a way that we have a simple one-to-one 
transformation. For this example, we take Yy = X2. Then y; = 214+ 22 and y2 = £2 
represent a one-to-one transformation that maps S onto 


T = {(yi, 92): y2 =0,1,...,y. and y, =0,1,2, ...}. 


Note that if (yi,y2) € T, then 0 < yo < y. The inverse functions are given by 
t= Yi — y2 and x2 = yg. Thus the joint pmf of Y; and Y2 is 


Y1—Y2 ,,Y2 .—-p1—-pe 
Hy a 


PY, ,Y2(Y1, Y2) = ) 1,42) ET, 
Yo (Yi, Y2) ae (y1, 42) 
and is zero elsewhere. Consequently, the marginal pmf of Y; is given by 
Y1 
PY, (y1) = > PY,,Y2 (1, y2) 
y2=0 
ihe, 28 
a ee ; ya! yi-Y2 ,,Y2 
7 yi! d (i — yaya! = 
y2=0 
Yi e~H1i— v2 
= ee y, = 0,1,2,..., 
Y1: 
and is zero elsewhere, where the third equality follows from the binomial expansion. 


For the continuous case we begin with an example that illustrates the cdf tech- 
nique. 


Example 2.2.2. Consider an experiment in which a person chooses at random 
a point (X1, X2) from the unit square S = {(@1,%2) : 0 < a1 <1, 0 < a < 1}. 
Suppose that our interest is not in _X; or in X2 but in Z = X,+ X2. Once a suitable 
probability model has been adopted, we shall see how to find the pdf of Z. To be 
specific, let the nature of the random experiment be such that it is reasonable to 
assume that the distribution of probability over the unit square is uniform. Then 
the pdf of X; and X2 may be written 


peu yaf i O<m<l, 0<m<1 
X1,X2\%1, 02) = 0 elsewhere, 


and this describes the probability model. Now let the cdf of Z be denoted by 
Fz(z) = P(X, + X2 < z). Then 


(2.2.1) 


0 2-< 0) 

ie re dz2dx1 = = 0 < Z< 1 
-—Zz 2 

Lm fog Jo, deedey =1- FE 1<2<2 

1 2<2z. 


Fz(z) => 
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Since F(z) exists for all values of z, the pmf of Z may then be written 


z O<2<1 
fz(z)= 2-z 1<z<2 (2.2.2) 
0 elsewhere. 


In the last example, we used the cdf technique to find the distribution of the 
transformed random vector. Recall in Chapter 1, Theorem 1.7.1 gave a transfor- 
mation technique to directly determine the pdf of the transformed random variable 
for one-to-one transformations. As discussed in Section 4.1 of the accompanying re- 
source Mathematical Comments,* this is based on the change-in-variable technique 
for univariate integration. Further Section 4.2 of this resource shows that a simi- 
lar change-in-variable technique exists for multiple integration. We now discuss in 
general the transformation technique for the continuous case based on this theory. 

Let (X1, X2) have a jointly continuous distribution with pdf fx, x, (71,72) and 
support set S. Consider the transformed random vector (Yi, Y2) = T(X1, X2) where 
T is a one-to-one continuous transformation. Let T = T(S) denote the support of 
(Y1, Y2). The transformation is depicted in Figure 2.2.1. Rewrite the transforma- 
tion in terms of its components as (Yi, Y2) = T(X1, X2) = (ui (X1, X2), ua(X1, X2)), 
where the functions y; = ui(a1,%2) and yg = u2(#1, £2) define T. Since the trans- 
formation is one-to-one, the inverse transformation T—! exists. We write it as 
Ly = wil(yi,y2), 2 = We(yi,y2). Finally, we need the Jacobian of the transfor- 
mation which is the determinant of order 2 given by 


Ox, Oar 
t= Oy1 — Oy2 

Orn Ox 

Oy Oy2 


Note that J plays the role of dx/dy in the univariate case. We assume that these 
first-order partial derivatives are continuous and that the Jacobian J is not identi- 
cally equal to zero in T. 

Let B be any region® in T and let A = T~1(B) as shown in Figure 2.2.1. 
Because the transformation T is one-to-one, P[(X1,X2) € A] = P[T(X1,X2) € 
T(A)] = P[(¥i, Y2) € B]. Then based on the change-in-variable technique, cited 
above, we have 


P((X1,X2)€ A] = ‘| fx,,X2(®1, £2) dx, dy2 
A 
= i fxy.x2[T~*(y1, y2)]|J| dyrdye 
T(A) 
= / fx,,X2[W1(Y1, 2), W2(Y1, Y2)]|F| dyr dye. 
B 


4See the reference for Mathematical Comments in the Preface. 
5Technically an event in the support of (¥1, Yo). 
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Since B is arbitrary, the last integrand must be the joint pdf of (Yi, Y2). That is 
the pdf of (Yi, Y2) is 


_ ff fxix.lwi(y1, y2), way, y2)IIF1 (yi, y2) € T 
fY1,¥ (yi, y2) = { 0 leanne: (2.2.3) 


Several examples of this result are given next. 


y2 
A 


XQ 


= X 


Figure 2.2.1: A general sketch of the supports of (X1, X2), (S), and (Yi, Y2), (JT). 


Example 2.2.3. Reconsider Example 2.2.2, where (X1, X2) have the uniform dis- 
tribution over the unit square with the pdf given in expression (2.2.1). The support 
of (X1, X2) is the set S = {(x%1, 22) : 0 < #1 < 1,0 < x2 < 1} as depicted in Figure 
2.2.2. 

Suppose Y; = X; + X2 and Y2 = X; — Xe. The transformation is given by 


91 = 01 (21,22) = 24 + Eo 


Yy2>= u2(x1, £2) = 71 — &92. 


This transformation is one-to-one. We first determine the set TJ in the y,y2-plane 
that is the mapping of S under this transformation. Now 


Ly = wi(yi,y2) = 3(y1 + y2) 
2 = we(y1,y2) = $(y1 — ya). 


To determine the set S in the y;y2-plane onto which T is mapped under the transfor- 
mation, note that the boundaries of S are transformed as follows into the boundaries 
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XQ 
A 
X= 1 
x, =0 Ss x, =1 
= X 
(0, 0) xX) =0 


Figure 2.2.2: The support of (X1, X2) of Example 2.2.3. 


of T: 
v= 0 into 0O= (yi a y2) 
zj=1 into l= 4 (yi + y2) 
v2 =0 into 0= $(y1 — yo) 


t2=1 into l= 5(y1 — Y2). 


Accordingly, T is shown in Figure 2.2.3. Next, the Jacobian is given by 


Oxy Ox, 

Oy. Oya 2 2 | 
— 1 = ———— 
Y=) G22 02 |~| 1 1 |7 72 

Oyt Oyo 2 2 


Although we suggest transforming the boundaries of S, others might want to 
use the inequalities 
O0<a,<1 and O0<a<l 


directly. These four inequalities become 

0<$(yt+y2)<1 and 0< $(y —y) <1. 
It is easy to see that these are equivalent to 

—y<y2, yo<2-y, yo<ryr yr-2< yp; 


and they define the set T. 
Hence, the joint pdf of (Y;, Y2) is given by 


1 1 1 
= + ia _— Ah =F ; E a: 
Fam) ={ feelin te den n= Qu) 
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(0, 0) 


Figure 2.2.3: The support of (Yi, Y2) of Example 2.2.3. 


The marginal pdf of Y, is given by 
fy) = / fyi.vo (yi, y2) dye. 


If we refer to Figure 2.2.3, we can see that 


wy 3 2 = Owes 1 

2-1 
futm= 4 Sj 2 3dy2=2-y l<y <2 
() elsewhere, 


which agrees with expression (2.2.2) of Example 2.2.2. In a similar manner, the 
marginal pdf fy, (y2) is given by 


fer sy = yo 41 =1Lews0 
ys (yo) = Sr P ddyy =1—y O0<ye<l 
0 elsewhere. 


Example 2.2.4. Let Y; = $(X1 — X2), where X; and X92 have the joint pdf 


—HF22) 0<2, <0, 0<22<0 


1 
_f per ( 
fx, ,X2(@1, 2) = { 0 elsewhere. 


Let Y2 = X2 so that y; = 4 (21 — 22), yo = 2 or, equivalently, 7; = 2y1 + y2, v2 = 
y2, define a one-to-one transformation from S = {(#1,72) :0< a1 < 0,0 <a < 
oo} onto T = {(y1,y2) : —2y1 < yo and0 < yg < co, —0o < ys < co}. The 
Jacobian of the transformation is 
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hence the joint pdf of Y; and Y2 is 


eal —y1l—y2 T 
f¥.,¥o (yi, y2) = { 0 7 (y1,¥2) € 


elsewhere. 
Thus the pdf of Y, is given by 
(em ze iy dy2 = - eM co <y, <0 
ty (y1) = - 
5 4 e 41-8 dys = 4 e% 0<y, <0, 
or 
fvi(y) = 3e7M!, —co <y1 <o0. (2.2.4) 


Recall from expression (1.9.20) of Chapter 1 that Y, has the Laplace distribution. 
This pdf is also frequently called the double exponential pdf. mg 


Example 2.2.5. Let X; and X2 have the joint pdf 


102,22 0<24,<%<1 


fx,,X2(@1, £2) = { 0 elsewhere. 


Suppose Y; = X,/X2 and Yj = X2. Hence, the inverse transformation is 71 = yiy2 
and x2 = y2, which has the Jacobian 


_| ¥2 Vi 
ake 


The inequalities defining the support S of (X1, X2) become 
0 < yiy2, yiy2 < yo, and y2 <1. 
These inequalities are equivalent to 
O0<y<1 and 0<y <1, 
which defines the support set T of (Yi, Y2). Hence, the joint pdf of (Y1, Y2) is 
five (1, y2) = Wyryoysly2| = l0yiy2, (yr, y2) € T- 


The marginal pdfs are 


1 
fyi (v1) =| 10yiy3 dy2 =2y1, O<y <1, 
0 


zero elsewhere, and 


1 
fra(ye) = | L0yiy3 dy: = 5y3, O<y <1, 
0 


zero elsewhere. @ 
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In addition to the change-of-variable and cdf techniques for finding distributions 
of functions of random variables, there is another method, called the moment 
generating function (mgf) technique, which works well for linear functions of 
random variables. In Subsection 2.1.2, we pointed out that if Y = g(X1, X2), then 
E(Y), if it exists, could be found by 


E(Y) = i / (152) focy, x (15:02) drrdeg 


in the continuous case, with summations replacing integrals in the discrete case. 
Certainly, that function g(X1, X2) could be exp{tu(X1, X2)}, so that in reality 
we would be finding the mgf of the function Z = u(X1,X2). If we could then 
recognize this mgf as belonging to a certain distribution, then Z would have that 
distribution. We give two illustrations that demonstrate the power of this technique 
by reconsidering Examples 2.2.1 and 2.2.4. 


Example 2.2.6 (Continuation of Example 2.2.1). Here X; and X2 have the joint 
pmf 


GY 22 ,—hi sh 
Hy Myre Mle _ _ 
ie, eee = ae C—« == 1, 2,3,..., 22 = 0,1,2,3,... 
1,42 3 
elsewhere, 


where py and pg are fixed positive real numbers. Let Y = X; + X2 and consider 


Ee) = Ds Se ss ce es) 


21=0 r2=0 
love) 


Z1p—H1 oo L2p—b2 
ss ta, Le € tay Le e 
——t ) e —— ) e —_— 

1! ry! 


x1=0 x2=0 


7 - ys | [e Feo) i 


x1=0 
= [ent] [erste] 


e(HatHe)(eh—1)_ 


Notice that the factors in the brackets in the next-to-last equality are the mgfs of 
X, and X92, respectively. Hence, the mgf of Y is the same as that of X1 except p11 
has been replaced by fz; + 4g. Therefore, by the uniqueness of mgfs, the pmf of Y 
must be 


py (y) = e7 (HitH2) 


y 
i y = 0,1,2,..., 


y 
which is the same pmf that was obtained in Example 2.2.1. m 


Example 2.2.7 (Continuation of Example 2.2.4). Here X; and X2 have the joint 
pdf 
mtr) 0 <2, <00, 0< 22 <00 


1 
_ J zexp(- 
fx, ,X2 (41,02) = { 0 elsewhere. 
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So the mgf of Y = (1/2)(X, — X2) is given by 


co co 1 


oO] oy 

if pena ae,| if pen aacn40)/2 ars| 
0 0 
1 1]. 1 

1—t|}14+t] 1-# 


provided that 1—t > 0 and 1+t > 0; ie, —1 < t < 1. However, the megf of a 
Laplace distribution with pdf (1.9.20) is 


ee) —|2| 0 (1+t)x oo L(t—-l1)ax 
[85 dx [5 ae+ f — dx 


il 1 1 


A+’ 2-8) 1-2’ 


I 


provided —1 < t < 1. Thus, by the uniqueness of mgfs, Y has a Laplace distribution 
with pdf (1.9.20). = 


EXERCISES 


2.2.1. If p(ar,a2) = (2) +22(4)2-™ 72, (21,22) = (0,0), (0,1), (1,0), (1, 1), zero 
elsewhere, is the joint pmf of X, and X9, find the joint pmf of Y; = X, — X2 and 
Yo = X1 4+ Xo. 

2.2.2. Let X, and X92 have the joint pmf p(a1,22) = 2142/36, 7; = 1,2,3 and 
x2 = 1,2,3, zero elsewhere. Find first the joint pmf of Yj = X,X2 and Y2 = Xo, 
and then find the marginal pmf of Y;. 


2.2.3. Let X, and X2 have the joint pdf h(a1, a2) = 2e7*!~*2, 0< a1 < ao < ~H, 
zero elsewhere. Find the joint pdf of Y; = 2X; and Yo = X2— X1. 


2.2.4. Let X, and X92 have the joint pdf h(a1,x2) = 84,22, 0 < x1 < x2 < 1, zero 
elsewhere. Find the joint pdf of Yj = X1/X2 and Yo = Xo. 

Hint: Use the inequalities 0 < yiy2 < yo < 1 in considering the mapping from S 
onto JT. 


2.2.5. Let X, and X2 be continuous random variables with the joint probability 
density function fx,,x,(1,%2), —oo < a < co, i = 1,2. Let Yj = X) + X2 and 
Yo = X92. 


(a) Find the joint pdf fy, vy. 
(b) Show that 
fri(tn) =f far.xa(n — ye,ue) da (2.2.5) 


which is sometimes called the convolution formula. 
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2.2.6. Suppose X; and X2 have the joint pdf fx, x, (#1, 22) =e 4%), 0 < ay < 
co, i= 1,2, zero elsewhere. 


(a) Use formula (2.2.5) to find the pdf of Y; = X1 + Xo. 
(b) Find the megf of Yj. 


2.2.7. Use the formula (2.2.5) to find the pdf of Y; = X, + X2, where X; and X2 
have the joint pdf fx, x, (@1, 22) = Qe~ ("1 +2) 0 < x1 < x2 < 00, zero elsewhere. 


2.2.8. Suppose X, and X2 have the joint pdf 


Fi e “te"72, 4, >0,%2 > 0 
ree A) elsewhere. 


For constants w; > 0 and wa > 0, let W = w,X1 + wo Xo. 
(a) Show that the pdf of W is 


fe Ot ge) ag >) 
fw(w) = { 0 _ elsewhere. 


(b) Verify that fy (w) > 0 for w > 0. 


(c) Note that the pdf fw(w) has an indeterminate form when w) = w2. Rewrite 
fw(w) using h defined as w; — w2 = h. Then use l’H6pital’s rule to show that 
when w 1 = We, the pdf is given by fw(w) = (w/w?) exp{—w/w:} for w > 0 
and zero elsewhere. 


2.3. Conditional Distributions and Expectations 


In Section 2.1 we introduced the joint probability distribution of a pair of random 
variables. We also showed how to recover the individual (marginal) distributions 
for the random variables from the joint distribution. In this section, we discuss 
conditional distributions, i.e., the distribution of one of the random variables when 
the other has assumed a specific value. We discuss this first for the discrete case, 
which follows easily from the concept of conditional probability presented in Section 
1.4. 

Let X; and X2 denote random variables of the discrete type, which have the joint 
pmf px,,x,(#1,%2) that is positive on the support set S and is zero elsewhere. Let 
px,(«1) and px,(x2) denote, respectively, the marginal probability mass functions 
of X; and X2. Let 2; be a point in the support of X1; hence, px, (#1) > 0. Using 
the definition of conditional probability, we have 


P(X1 = 11,X2=%2) _ Px1,X2 (1,02) 
P(X = 21) Px, (21) 


for all x2 in the support Sx, of X2. Define this function as 


P(X x2|X1 X1) 


PX1,X2 (x1, £2) 
? 


x2 € Sx,. (2.3.1) 
PX, (21) 


PX2|X1 (x2|a1) = 
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For any fixed x; with px,(#1) > 0, this function px,)x, (#2|71) satisfies the con- 
ditions of being a pmf of the discrete type because px,)x,(x2|21) is nonnegative 
and 


ye PX, ,Xo ( zm “ 
px, (x 


px, (1) 1) 
= FACE PX1,X2(%1, 2 =: 
PX, (21) —s i= PX, (#1) 1) 


SS PxalX, (x2|x1) 
@2 


We call px,|x, (€2|v1) the conditional pmf of the discrete type of random variable 
X2, given that the discrete type of random variable X; = x;. In a similar manner, 
provided a2 € Sx,, we define the symbol px,)x,(1|%2) by the relation 


PX,X2 (x1, x2) 


» LE Sx ’ 
PX, (22) : ' 


PX,|X2 (v1 |v2) = 

and we call px,|x,(#1|£2) the conditional pmf of the discrete type of random variable 
Xj, given that the discrete type of random variable X2 = x2. We often abbreviate 
PX4|X_(@1|€2) by Prjo(wi|e2) and px,)x,(€2|e1) by pyji(x2|a1). Similarly, pi (x1) 
and p2(#2) are used to denote the respective marginal pmfs. 
Now let X, and X2 denote random variables of the continuous type and have the 
joint pdf fx, x,(%1,%2) and the marginal probability density functions fx, (#1) and 
fx.(a2), respectively. We use the results of the preceding paragraph to motivate 
a definition of a conditional pdf of a continuous type of random variable. When 
fx, (#1) > 0, we define the symbol fx,)x,(x2|r1) by the relation 


Fx1,X2(£1, 22) 


fx, (21) 


In this relation, x; is to be thought of as having a fixed (but any fixed) value for 
which fx, (v1) > 0. It is evident that fx,)x,(2|v1) is nonnegative and that 


ies = fx1,X2(@1, £2) 
|. fxo|x, (@2|@1) dre Lae faa) dx 


fx2|x, (€2|21) = (2.3.2) 


= (@ ao! fx. .Xo(1, £2) dra 


1 
~ Faye en =™ 


That is, fx,)x,(£2|21) has the properties of a pdf of one continuous type of random 
variable. It is called the conditional pdf of the continuous type of random variable 
X2, given that the continuous type of random variable X, has the value x,. When 
fx. (x2) > 0, the conditional pdf of the continuous random variable Xj, given that 
the continuous type of random variable X2 has the value x2, is defined by 


FRX (11, £2) 


fx,(t2) fx (a2) > 0. 


fx1|Xe (xi|z2) = 
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We often abbreviate these conditional pdfs by f1j2(v1|"2) and fo\1(2|"1), respec- 
tively. Similarly, fi(a1) and f2(x2) are used to denote the respective marginal pdfs. 

Since each of f2)1(v2|v1) and f})2(71|22) is a pdf of one random variable, each 
has all the properties of such a pdf. Thus we can compute probabilities and math- 
ematical expectations. If the random variables are of the continuous type, the 
probability 


b 
P(a < X_ < b|X, = 21) =} fa|1 (w2|@1) dae 


is called “the conditional probability that a < X2 < b, given that X, = x1.” If there 
is no ambiguity, this may be written in the form P(a < X2 < b|x,). Similarly, the 
conditional probability that c < X, < d, given X2 = %2, is 


d 
P(c< X1 <d|X2 = 22) = / fijo(@1|2) dai. 


If u(X2) is a function of X2, the conditional expectation of u(X2), given that 
X, = 1, if it exists, is given by 


Co 


Bu(Xa)lea] =f ulee) fay eoler) dr. 
—co 

Note that E[u(X2)|x1] is a function of 7. If they do exist, then E(X2|x1) is the 

mean and E{[X2 — E(X92|21)|?|v1} is the conditional variance of the conditional 

distribution of X2, given X1 = 21, which can be written more simply as Var(X2|71). 

It is convenient to refer to these as the “conditional mean” and the “conditional 

variance” of X2, given X; = x1. Of course, we have 


Var(X9|a1) = E(X3|21) — [E(X2|21))? 
from an earlier result. In a like manner, the conditional expectation of u(X1), given 
X= £2, if it exists, is given by 


CO 


Elu(X1) 2] = / ee i eulaa aay: 


—co 


With random variables of the discrete type, these conditional probabilities and 
conditional expectations are computed by using summation instead of integration. 
An illustrative example follows. 


Example 2.3.1. Let X; and X2 have the joint pdf 


fina 2 O0< a <a%<1 
182) ~~) 0 elsewhere. 


Then the marginal probability density functions are, respectively, 


1 
fi(a1) = Jp, 2dx2 = 2(1 — 21) 0<a,<1 
0 elsewhere, 
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and : 
9 2da1 = 2x2 O0<2a<1 
elsewhere. 


fa(ea)={ | 
The conditional pdf of X1, given X2 = %2, 0 < rg < 1, is 


2 1 
so == 0a <a7< 1 
fij2(v1|r2) { 0 elsewhere. 


Here the conditional mean and the conditional variance of X1, given X2 = x2, are 
respectively, 


E(Xj|r2) = / @1 f1\2(v1|v2) day 
r2 1 
= | Ly (=) dx, 
0 v2 
= > 0< a <1, 
and 
x2 ) 1 
Var(X4|22) = | (m = =) (=) dx 
0 2 LQ 
= ap 0< <1 
~ 72’ re 


Finally, we compare the values of 
P(0 < X, < $|X2=#) and P(0< X, < $). 


We have 


1/2 1/2 
PO<X, < 4|X2=#)= froler|$) der = [ ( 
0 0 


ole 
pee 
Q 
8 
E 
II 
woth 


but 
PO0<X,<4)= i fila) dx; = fe 2(1—2,)dz,=3. » 


Since E'(X92|21) is a function of 7, then E(X2|X1) is a random variable with its 
own distribution, mean, and variance. Let us consider the following illustration of 
this. 


Example 2.3.2. Let X; and X2 have the joint pdf 


f( ve 6% O< a2 <4, <1 
ee oy elsewhere. 


Then the marginal pdf of Xj is 


ZY 
fi(a1) =} 6x2 dxg = Sai, 0<a< 1, 
0 
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zero elsewhere. The conditional pdf of X2, given X, = 71, is 


6 2 222 
faji(@2|21) =sa =a O<m<n, 
3x7 1 


zero elsewhere, where 0 < x; < 1. The conditional mean of X2, given X, = 71, is 


se 2 2 
E(X2|x1) =| v2 (=) dx = Zi, Ox ak 
0 ry 3 


Now E(X2|X1) = 2X1/3 is a random variable, say Y. The cdf of Y = 2X,/3 is 
3 2 
Gu) =P <u) =P(% <2), US oe 
From the pdf fi (#1), we have 


ne 27y3 2 
cw = | 322 dx, = —2, O<y<-. 
, 8 3 


Of course, G(y) = 0 if y < 0, and G(y) = 1 if  < y. The pdf, mean, and variance 
of Y = 2X,/3 are 


zero elsewhere, 


and 


2/3 2 
var(Y) = [ y? (= ) ij, 
: 8 1 60 


Since the marginal pdf of Xo is 


1 
fo(x2) = 6x29 dx, = 6x2(1 _ x2), 0< 122 < 1, 


2 


zero elsewhere, it is easy to show that E(X2) = $ and Var(X2) = a That is, here 
E(Y) = E[E(X2|X1)] = E(X2) 


and 
Var(Y) = Var[E(X2|X1)] < Var(X2). 


Example 2.3.2 is excellent, as it provides us with the opportunity to apply many 
of these new definitions as well as review the cdf technique for finding the distri- 
bution of a function of a random variable, namely Y = 2X1/3. Moreover, the two 
observations at the end of this example are no accident because they are true in 
general. 
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Theorem 2.3.1. Let (X1,X2) be a random vector such that the variance of Xq is 
finite. Then, 


(a) E[E(X2|X1)] = E(X2). 
(b) Var[E(X2|X1)] < Var(X2). 


Proof: The proof is for the continuous case. To obtain it for the discrete case, 
exchange summations for integrals. We first prove (a). Note that 


E(X2) il i v2 f (x1, v2) dxgdx1 


7 [. fee dra} fale) des 


= a E(X2|21) fi(ai) dry 


—co 


E\E(X2|X1)I, 


which is the first result. 
Next we show (b). Consider with 2 = E(X2), 


Var(X2) = El(X2— p12)" 
= E{[Xq — E(X2|X1) + E(X2|X1) — p2]*} 
= E{[X2 — E(X2|X1))?} + E{[B(X2|X1) — p2]?} 
+ 2E{[X2 — E(X2|X1)][E(X2|X1) — pe). 


We show that the last term of the right-hand member of the immediately preceding 
equation is zero. It is equal to 


, / . / “les — B(Xo|a4)|[B(Xole1) — pa] f(ers2») dead, 


=2 f [E(Xeler) — 2] ff [er — E(%olorryL™) at fy) ar. 
[. tf fi(a1) 


But E(X2|x1) is the conditional mean of X2, given X; = x1. Since the expression 
in the inner braces is equal to 


E(X2|21) = E(X2|21) = 0, 
the double integral is equal to zero. Accordingly, we have 
Var(X2) = E{[X2 — E(X2|X1)|?} + E{[E(X2|X1) — p2]"}- 


The first term in the right-hand member of this equation is nonnegative because it 
is the expected value of a nonnegative function, namely [X2 — E(X2|X1)|?. Since 
E|E(X2|X1)] = 2, the second term is Var|£(X2|X1)]. Hence we have 


Var(X2) > Var[E(X2|X1)], 
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which completes the proof. m 


Intuitively, this result could have this useful interpretation. Both the random 
variables X2 and E(X»2|X1) have the same mean py. If we did not know ju, we 
could use either of the two random variables to guess at the unknown jg. Since, 
however, Var(X2) > Var[E(X2|X1)], we would put more reliance in E(X2|X1) asa 
guess. That is, if we observe the pair (X1, X2) to be (a1, 22), we could prefer to use 
E(X2|a1) to x2 as a guess at the unknown p2. When studying the use of sufficient 
statistics in estimation in Chapter 7, we make use of this famous result, attributed 
to C. R. Rao and David Blackwell. 

We finish this section with an example illustrating Theorem 2.3.1. 


Example 2.3.3. Let X; and X2 be discrete random variables. Suppose the condi- 
tional pmf of X1 given X2 and the marginal distribution of X2 are given by 


x i 
p(@1|22) = (*) (5) » 67 = 0; 1,000, 29 
2 1 vo—-1 
p(t2) = =(5) 5 to SL 2yeSeses 


Let us determine the megf of X,. For fixed x2, by the binomial theorem, 


v2 x . 1 @2-%1 1 Gy 
E (e***|z2) — » & et . (5) (5) 


Hence, by the geometric series and Theorem 2.3.1, 


E(e) = E[E(e*|X))] 


<< day” i 
~ 5 a" 3 


oH 


1 ” 1 1 
2° 2° ) 1 (0/6) + C/o’ 
provided (1/6) + (1/6)e’ < 1 or t < log 5 (which includes ¢ = 0). = 


EXERCISES 


2.3.1. Let X, and X» have the joint pdf f(a1,72) = a1 +22, O< a1 <1,0< 
t2 < 1, zero elsewhere. Find the conditional mean and variance of X2, given 
ApS 2s 0< 2, <1. 
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2.3.2. Let fijo(v1|r2) = e121 /22, 0 < 21 < x2, 0 < xq < 1, zero elsewhere, and 
fo(a2) = cox3, 0 < xe < 1, zero elsewhere, denote, respectively, the conditional pdf 
of Xj, given X2 = x2, and the marginal pdf of X2. Determine: 


(a) The constants c; and co. 
(b) The joint pdf of X; and X2. 
(c) P(q < X1 < $|X2 = 2). 
(d) P(Z < Xi <4). 


2.3.3. Let f(v1,22) = 2la7z3, 0 < x1 < x2 < 1, zero elsewhere, be the joint pdf 
of X\ and Xo. 


(a) Find the conditional mean and variance of Xj, given X2 = 42, 0< 4g <1. 
(b) Find the distribution of Y = E(X|X2). 


(c) Determine E(Y) and Var(Y) and compare these to E(X,) and Var(Xj), re- 
spectively. 


2.3.4. Suppose X, and X»2 are random variables of the discrete type that have 
the joint pmf p(a1,%2) = (%1 + 2%2)/18, (%1,%2) = (1,1), (1, 2), (2, 1), (2,2), zero 
elsewhere. Determine the conditional mean and variance of X2, given X, = x1, for 
x, =1 or 2. Also, compute E(3X 1 — 2X2). 


2.3.5. Let X, and X2 be two random variables such that the conditional distribu- 
tions and means exist. Show that: 


(a) E(X, + Xo | Xo) = E(X, | Xo) + Xa, 
(b) E(u(X2) | X2) = u(X). 


2.3.6. Let the joint pdf of X and Y be given by 


7 0<4<w,0<y<a 
—) (Wrety ’ y 
F(a,y) { 0 elsewhere. 


(a) Compute the marginal pdf of X and the conditional pdf of Y, given X = «. 
(b) For a fixed X = x, compute E(1+ a+ Y |x) and use the result to compute 


2.3.7. Suppose X; and X92 are discrete random variables which have the joint pmf 
p(@1,%2) = (841 + ©2)/24, (1, 72) = (1,1), (1, 2), (2, 1), (2, 2), zero elsewhere. Find 
the conditional mean F'(X2|r1), when x = 1. 


2.3.8. Let X and Y have the joint pdf f(x,y) = 2exp{—(r+y)}, 0<r<y<o, 
zero elsewhere. Find the conditional mean E(Y |x) of Y, given X = x. 
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2.3.9. Five cards are drawn at random and without replacement from an ordinary 
deck of cards. Let X; and X2 denote, respectively, the number of spades and the 
number of hearts that appear in the five cards. 


(a) Determine the joint pmf of X; and X2. 
(b) Find the two marginal pmfs. 
(c) What is the conditional pmf of X2, given X; = x1? 
2.3.10. Let X; and X2 have the joint pmf p(a1, 22) described as follows: 


? 
1 3 4 3 
p(@1, £2) 78 8 8 8 


0) (2,1) 


le 
Slr 


and p(a1,2%2) is equal to zero elsewhere. Find the two marginal probability mass 
functions and the two conditional means. 
Hint: Write the probabilities in a rectangular array. 


2.3.11. Let us choose at random a point from the interval (0,1) and let the random 
variable X, be equal to the number that corresponds to that point. Then choose 
a point at random from the interval (0,21), where x7; is the experimental value of 
Xj ; and let the random variable X2 be equal to the number that corresponds to 
this point. 


(a) Make assumptions about the marginal pdf fi(#1) and the conditional pdf 
faj1(v2|21). 

(b) Compute P(X; + X2 > 1). 

(c) Find the conditional mean E(X,|2). 


2.3.12. Let f(x) and F(x) denote, respectively, the pdf and the cdf of the random 
variable X. The conditional pdf of X, given X > xg, x a fixed number, is defined 
by f(a|X > a0) = f(#)/[1—F(ao)], vo < x, zero elsewhere. This kind of conditional 
pdf finds application in a problem of time until death, given survival until time zo. 


(a) Show that f(x|X > 2) is a pdf. 


(b) Let f(z) =e"*, 0 < a < ~w, and zero elsewhere. Compute P(X > 2|X > 1). 


2.4 Independent Random Variables 


Let X, and X2 denote the random variables of the continuous type that have the 
joint pdf f(a1,v2) and marginal probability density functions f (a1) and fo(x2) 
respectively. In accordance with the definition of the conditional pdf fy); (2|1) 
we may write the joint pdf f(a#1, 72) as 


f(@1,%2) = faji(@2|@1) fi(r1). 


y] 


y] 
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Suppose that we have an instance where f2)1(2|#1) does not depend upon a. Then 
the marginal pdf of X2 is, for random variables of the continuous type, 


felt2) = a fysl@ole file) aes 


l| 


fa (ra|e) [ fi(w1) day 
= foj1(w2|21). 


Accordingly, 


fo(t2) = fojr(welai1) and f(a1, 22) = fi(a1) fo(xe), 


when f2)1(%2|%1) does not depend upon #1. That is, if the conditional distribution 
of Xo, given X1 = 21, is independent of any assumption about 21, then f(a#1,72) = 
fi (21) fa(x2). 

The same discussion applies to the discrete case too, which we summarize in 
parentheses in the following definition. 


Definition 2.4.1 (Independence). Let the random variables X; and X2 have the 
joint pdf f(a1, 22) [joint pmf p(x1,x2)] and the marginal pdfs [pmfs] fi(a1) [pi(a1)/ 
and f2(x2) [p2(x2)], respectively. The random variables X, and X2 are said to be 
independent jf, and only if, f(v1,22) = fi(w1)fo(w2) [p(v1, 22) = pi(w1)p2(x2)/. 
Random variables that are not independent are said to be dependent. 


Remark 2.4.1. Two comments should be made about the preceding definition. 
First, the product of two positive functions f; (a1) fo(a2) means a function that is 
positive on the product space. That is, if fi(a1) and f2(a2) are positive on, and 
only on, the respective spaces S; and Sg, then the product of f,(a1) and fo(x2) 
is positive on, and only on, the product space S = {(a1, 22) : 11 € Si, rq € So}. 
For instance, if S; = {a1 : 0 < a < 1} and Sg = {xg : 0 < a < 3}, then 
S = {(a1,%2) : 0 < a1 < 1, 0 < xq < 3}. The second remark pertains to the 
identity. The identity in Definition 2.4.1 should be interpreted as follows. There 
may be certain points (11,22) € S at which f(@1, 272) A fi(x1)fo(2). However, if A 
is the set of points (#1, 22) at which the equality does not hold, then P(A) = 0. In 
subsequent theorems and the subsequent generalizations, a product of nonnegative 
functions and an identity should be interpreted in an analogous manner. 


Example 2.4.1. Suppose an urn contains 10 blue, 8 red, and 7 yellow balls that 
are the same except for color. Suppose 4 balls are drawn without replacement. Let 
X and Y be the number of red and blue balls drawn, respectively. The joint pmf 
of (X,Y) is 


eG) ena) 
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Since X + Y < 4, it would seem that X and Y are dependent. To see that this is 
true by definition, we first find the marginal pmf’s which are: 


10 15 
px(2) = '2)le-o) 


py(y) = “Sor, O<y<4. 


To show dependence, we need to find only one point in the support of (X1, X2) where 
the joint pmf does not factor into the product of the marginal pmf’s. Suppose we 
select the point x = 1 and y = 1. Then, using R for calculation, we compute (to 4 


laces): 
p(l,1) = 10-8: (3) /(7?) = 0.1328 
px(t) = 10()/(7) = 0.3807 
pt) = 8()/(??) = 0.300. 


Since 0.1328 £ 0.1547 = 0.3597 - 0.4800, X and Y are dependent random variables. 
a 


Example 2.4.2. Let the joint pdf of X; and X2 be 


Jf +42 O0<a <1, 0O<a<1 
f(t1,%2) = { 0 elsewhere. 


We show that X, and X2 are dependent. Here the marginal probability density 
functions are 


AGS fie f(a1, 22) dxg = fo (a + x2) dag =2414+ = 0<2,<1 
0 elsewhere, 


and 


fo(za) _ fae f (a1, £2) dz, = fo (aa + x2) dz, = 5 +29 O<a<1 
0 elsewhere. 


Since f(a1,v2) # fi(ax1)fo(a2), the random variables X; and X2 are dependent. m 


The following theorem makes it possible to assert that the random variables X, 
and X2 of Example 2.4.2 are dependent, without computing the marginal probability 
density functions. 


Theorem 2.4.1. Let the random variables X, and Xz have supports S, and So, 
respectively, and have the joint pdf f(x1,x2). Then X, and X2 are independent if 
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and only if f(v1,%2) can be written as a product of a nonnegative function of x1 
and a nonnegative function of x2. That is, 


f(x1, 22) = g(21)h(z2), 
where g(a1) > 0, #1 € Si, zero elsewhere, and h(a2) > 0, x2 € Sg, zero elsewhere. 


Proof. If X, and X are independent, then f(x1,22) = fi (x1) fe(w2), where fi (x1) 
and f2(x2) are the marginal probability density functions of X; and X2, respectively. 
Thus the condition f(21, 22) = g(a1)h(a2) is fulfilled. 

Conversely, if f(#1,2%2) = g(a1)h(x2), then, for random variables of the contin- 
uous type, we have 


filer) =f glar)h(v2) dra =g(ar) f h(22) dre = exg(ar) 
and an = 
falta) =f o(aa)h(aa) dey = (za) g(r) der = eah(x2), 


where c; and cp are constants, not functions of x; or 72. Moreover, c1c2 = 1 because 


1 ad [. g(x1)h(w2) daydxy = if gle) de if h(a) dra = ec. 


These results imply that 


f (x1, %2) = g(@1)h(x2) = c1g(x1)e2h(x2) = fi(w1) fo(x2). 
Accordingly, X, and X2 are independent. m 


This theorem is true for the discrete case also. Simply replace the joint pdf by 
the joint pmf. For instance, the discrete random variables X and Y of Example 
2.4.1 are immediately seen to be dependent because the support of (X,Y) is not a 
product space. 

Next, consider the joint distribution of the continuous random vector (X,Y) 
given in Example 2.1.3. The joint pdf is 


= a gFes 3y? 
f(x,y) =4te"" ye’, a>0,y>0. 


which is a product of a nonnegative function of 7 and a nonnegative function of y. 
Further, the joint support is a product space. Hence, X and Y are independent 
random variables. 


Example 2.4.3. Let the pdf of the random variable X; and X2 be f(a1,22) = 
82122, 0 < 21 < 42 < 1, zero elsewhere. The formula 82,22 might suggest to some 
that X, and X92 are independent. However, if we consider the space S = {(a1, 2) : 
0 <a, < x2 < 1}, we see that it is not a product space. This should make it clear 
that, in general, X; and X2 must be dependent if the space of positive probability 
density of X, and X2 is bounded by a curve that is neither a horizontal nor a 
vertical line. 
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Instead of working with pdfs (or pmfs) we could have presented independence 
in terms of cumulative distribution functions. The following theorem shows the 
equivalence. 


Theorem 2.4.2. Let (X1, X2) have the joint cdf F(x1,x2) and let X1 and Xq have 
the marginal cdfs F\ (a1) and F2(x2), respectively. Then X, and X2 are independent 
if and only if 


F (21,22) = F,(#1)Fo(x2) for all (21,22) € R?. (2.4.1) 


Proof: We give the proof for the continuous case. Suppose expression (2.4.1) holds. 
Then the mixed second partial is 


fag 


Aaa x2) = fi(ai) fo(x2). 


Hence, X, and X2 are independent. Conversely, suppose X; and X92 are indepen- 
dent. Then by the definition of the joint cdf, 


F (x1, x2) -- 2 fil Wi ) fo (we) dwodw, 


= _ filer) )dwy- [. fo (we) dwg = F (x1) F(a). 
Hence, condition (2.4.1) is true. m 


We now give a theorem that frequently simplifies the calculations of probabilities 
of events that involves independent variables. 


Theorem 2.4.3. The random variables X, and X2 are independent random vari- 
ables if and only if the following condition holds, 


Pla< X1 <b,c< Xo <d)=P(a< X1 < b)P(c < Xp <d) (2.4.2) 
for everya <b andc<d, where a,b,c, and d are constants. 


Proof: If X, and X2 are independent, then an application of the last theorem and 
expression (2.1.2) shows that 


Pla< X1 <b,c< X2g<d) = F(b,d) — F(a,d) — F(b,c) + F(a,c) 
= F\(b)Fo(d) — Fi(a)Fo(d) — Fi (6) Fa(e) 
+F (a) F2(c) 


= [Fi(b) — Fi(a)|[Fo(d) — Fa(o)], 


which is the right side of expression (2.4.2). Conversely, condition (2.4.2) implies 
that the joint cdf of (X1, X2) factors into a product of the marginal cdfs, which in 
turn by Theorem 2.4.2 implies that X, and X2 are independent. m 
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Example 2.4.4 (Example 2.4.2, Continued). Independence is necessary for condi- 
tion (2.4.2). For example, consider the dependent variables X, and X2 of Example 
2.4.2. For these random variables, we have 


P(0< Xi <4,0<%) <4) = fh? PP (a + a2) daida =}, 


whereas ve 
P(0< X1<4)=f)? (a1 + 4)dm =3 
and 
mie 
P(0< X2<4 5)= (4 +21) dxrq = 
Hence, condition (2.4.2) does not hold. m 


Not merely are calculations of some probabilities usually simpler when we have 
independent random variables, but many expectations, including certain moment 
generating functions, have comparably simpler computations. The following result 
proves so useful that we state it in the form of a theorem. 


Theorem 2.4.4. Suppose X, and X2 are independent and that E(u(X,)) and 
E(v(X2)) exist. Then 


Elu(X1)o(X2)) = Elu(Xi)JE[o(X2)).- 


Proof. We give the proof in the continuous case. The independence of X; and X2 
implies that the joint pdf of X; and X9 is fi (a1) fo(x2). Thus we have, by definition 
of expectation, 


Elu(X:)o(X2)] = oe fw (2) fa (1) folare) dra dry 


x ene if v(2) fo(x2) dary 
= Elu(X1)|E[v(X2)). 


l| 


Hence, the result is true. @ 


Upon taking the functions u(-) and v(-) to be the identity functions in Theorem 
2.4.4, we have that for independent random variables X; and Xo, 


B(X4 Xo) = E(X,)B(X). (2.4.3) 


We next prove a very useful theorem about independent random variables. The 
proof of the theorem relies heavily upon our assertion that an mgf, when it exists, 
is unique and that it uniquely determines the distribution of probability. 


Theorem 2.4.5. Suppose the joint mgf, M(ti,t2), exists for the random variables 
X, and X2. Then X, and X2 are independent if and only if 


M(t, to) = M(t, 0)M(0, to); 


that is, the joint maf is identically equal to the product of the marginal mgfs. 
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Proof. If X, and X»2 are independent, then 

M(t, ta) = E(e'*1+t2X2) 
(ef1%1 et2X2) 
Ee) Be) 
= M(t1,0)M(0,t2). 


I 
& 


Thus the independence of X; and X2 implies that the mef of the joint distribution 
factors into the product of the moment-generating functions of the two marginal 
distributions. 

Suppose next that the mgf of the joint distribution of X; and X2 is given by 
M(t1, t2) = M(t, 0)M(0, te). Now X, has the unique megf, which, in the continuous 
case, is given by 

co 
M(t1,0) = / e!™1 fi (1) day. 
—co 


Similarly, the unique mgf of X2, in the continuous case, is given by 


M (0, te) => / e!2*2 fo(ao) daa. 


—cCo 


Thus we have 


M(t, 0)M(0, te) = [- e1 fi (x1) ie, | fe e'2*2 fo (ar) dx2 


/ / effi tteta fF (11) fo(x2) dx,dxo. 
We are given that M(t,,t2) = M(t, 0)M(0,t2); so 
M (ti, t2) = ‘| / eft ttet2 £ (71) fo(a2) daydr2. 


But M(t1,t2) is the mgf of X; and X. Thus 


M (ti, t2) => / / gra Fei) dx dx. 


The uniqueness of the mgf implies that the two distributions of probability that are 
described by fi (a1) fo(a2) and f(a1, 22) are the same. Thus 


f (a1, 22) = f(a) fo(xa). 


That is, if M(t1,t2) = M(t1,0)M(0,t2), then X, and X2 are independent. This 
completes the proof when the random variables are of the continuous type. With 
random variables of the discrete type, the proof is made by using summation instead 
of integration. 
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Example 2.4.5 (Example 2.1.10, Continued). Let (X,Y) be a pair of random 
variables with the joint pdf 


Je’ 0<a<y<aw 
f(x,y) = { 0 elsewhere. 


In Example 2.1.10, we showed that the mgf of (X,Y) is 


M(t1, to) 


, / exp(tiz + toy — y) dydx 
0 BA 
1 


(1 —t, —te)(1—te)’ 


provided that t; + tg < 1 and tg < 1. Because M(ti,t2) 4 M(t1,0)M(t1,0), the 
random variables are dependent. 


Example 2.4.6 (Exercise 2.1.15, Continued). For the random variable X; and X2 
defined in Exercise 2.1.15, we showed that the joint mgf is 


exp{ti } | E exp{te} 


2 — exp{ti} — exp{ta} 


We showed further that M(t,,t2) = M(t,0)M(0,t2). Hence, X; and X2 are inde- 
pendent random variables. 


M(t,t2) =| |. t; < log 2,i=1,2. 


EXERCISES 
2.4.1. Show that the random variables X, and X2 with joint pdf 


f( i= 12x1x9(1 — x2) O0<a,<1,0<a,<1 
ea elsewhere 


are independent. 


2.4.2. Ifthe random variables X; and X92 have the joint pdf f(a1, 72) = 2e~%1~2, 0< 
Ly <2, 0 < 22 < ow, zero elsewhere, show that X, and X2 are dependent. 


2.4.3. Let p(a1, 22) = a xv, = 1,2,3,4, and x2 = 1,2,3,4, zero elsewhere, be the 
joint pmf of X; and X2. Show that X; and X2 are independent. 


2.4.4, Find P(0 < X) < $,0 < X2 < $) if the random variables X; and X2 have 
the joint pdf f(a1, 72) = 4a1(1— x2), 0< a1 <1, 0 < xe < 1, zero elsewhere. 
2.4.5. Find the probability of the union of the events a < X; <b, —o0 < X23 <o, 
and —oo < X, < oo, c< X29 < dif X; and X2 are two independent variables with 
P(a < X, <b) = $ and P(c < Xp <d) = 2. 
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2.4.6. If f(ai,v2) = e7*1~ 2, 0 < 21 < cw, 0 < x2 < ow, zero elsewhere, is the 
joint pdf of the random variables X, and X2, show that X, and X2 are independent 
and that M(t, t2) = (1—t1)71(1 —te)7+, te <1, ti < 1. Also show that 


Berea ag", #1, 
Accordingly, find the mean and the variance of Y = X; + Xo. 


2.4.7. Let the random variables X; and X2 have the joint pdf f(xv1,22) = 1/7, for 
(a — 1)? + (a2 + 2)? < 1, zero elsewhere. Find fi (x1) and f2(x2). Are X and X2 
independent? 


2.4.8. Let X and Y have the joint pdf f(a, y) = 32, 0< y <a <1, zero elsewhere. 
Are X and Y independent? If not, find E(X|y). 


2.4.9. Suppose that a man leaves for work between 8:00 a.m. and 8:30 a.m. and 
takes between 40 and 50 minutes to get to the office. Let X denote the time of 
departure and let Y denote the time of travel. If we assume that these random 
variables are independent and uniformly distributed, find the probability that he 
arrives at the office before 9:00 a.m. 


2.4.10. Let X and Y be random variables with the space consisting of the four 
points (0,0), (1,1), (1,0), (1,1). Assign positive probabilities to these four points 
so that the correlation coefficient is equal to zero. Are X and Y independent? 


2.4.11. Two line segments, each of length two units, are placed along the z-axis. 
The midpoint of the first is between x = 0 and x = 14 and that of the second is 
between z = 6 and x = 20. Assuming independence and uniform distributions for 
these midpoints, find the probability that the line segments overlap. 


2.4.12. Cast a fair die and let X = 0 if 1,2, or 3 spots appear, let X = 1 if 4 or 5 
spots appear, and let X = 2 if 6 spots appear. Do this two independent times, 
obtaining X; and X2. Calculate P(|X1 — X2| = 1). 


2.4.13. For X, and X2 in Example 2.4.6, show that the mgf of Y = X, + X2 is 
e7! /(2 — e')?, t < log 2, and then compute the mean and variance of Y. 


2.5 The Correlation Coefficient 


Let (X,Y) denote a random vector. In the last section, we discussed the concept 
of independence between X and Y. What if, though, X and Y are dependent 
and, if so, how are they related? There are many measures of dependence. In 
this section, we introduce a parameter p of the joint distribution of (X,Y) which 
measures linearity between X and Y. In this section, we assume the existence of 
all expectations under discussion. 


Definition 2.5.1. Let (X,Y) have a joint distribution. Denote the means of X 
and Y respectively by 1 and 12 and their respective variances by of and a}. The 
covariance of (X,Y) is denoted by cou(X,Y) and is defined by the expectation 


col X,Y) = E[(X — p1)(Y — p2)|. of (2.5.1) 
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It follows by the linearity of expectation, Theorem 2.1.1, that the covariance of 
X and Y can also be expressed as 


cov(X,Y) = E(XY — poX — wY + pipe) 
E(XY) — poE(X) — mw E(Y) + pape 
= E(XY)— pipe, (2.5.2) 


which is often easier to compute than using the definition, (2.5.1). 
The measure that we seek is a standardized (unitless) version of the covariance. 


Definition 2.5.2. If each of 0, and o2 is positive, then the correlation coefficient 
between X and Y is defined by 


_ EX = m)(¥ = h2)] _ colXY) (2.5.3) 


0102 0102 


It should be noted that the expected value of the product of two random variables 
is equal to the product of their expectations plus their covariance; that is, E(XY) = 
fli pflg + cov(X, Y) = pipe + poida. 

As illustrations, we present two examples. The first is for a discrete model while 
the second concerns a continuous model. 


Example 2.5.1. Reconsider the random vector (X1, X2) of Example 2.1.1 where a 
fair coin is flipped three times and X, is the number of heads on the first two flips 
while X2 is the number of heads on all three flips. Recall that Table 2.1.1 contains 
the marginal distributions of X; and X2. By symmetry of these pmfs, we have 
E(X,) = 1 and E(X2) = 3/2. To compute the correlation coefficient of (X1, X2), 
we next sketch the computation of the required moments: 


1 1 3 1 
E(X?) = st Z=5 sofas -1?= 5; 
3 3 1 3\2 l 
BOG) = gta gtogdeof=3—(3) vas 
: : s 4.4 
veep = a 2 ere 2-5 42:3: 2 =2= cov(Xi, X2) = 2-1-5 =5 


From which it follows that p = (1/2)/(\/(1/2)\/3/4) = 0.816. = 


Example 2.5.2. Let the random variables X and Y have the joint pdf 


_f ety 0<2<1, 0<y<1 
f(t,y) = { 0 elsewhere. 


We next compute the correlation coefficient p of X and Y. Now 


i pi 
7 

mn = B(X) = [ | x(x + y)drdy = = 
o Jo 12 
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and 
2 2 2 ‘ : 2 d d 7 . 11 
= E(X?)- 2 = -{(—) =—. 
a= 2x) 12 = ff (e+ yaety- (5) = 
Similarly, 
=2{Y)= tae of = E(Y”)- w= me 
co ~ 2 2= M2 Taq 


The covariance of X and Y is 


1 1 7 2 1 
B(XY) ~ pana = f | vy(e +y)dedy ~ (5) =-ay 


Accordingly, the correlation coefficient of X and Y is 


ever Ok 1 
144 
s ( 11 ) 11 ) 11 
144 )\ [44 


We next establish that, in general, |p| < 1. 


Theorem 2.5.1. For all jointly distributed random variables (X,Y) whose corre- 
lation coefficient p exists, -1<p<1. 


Proof: Consider the polynomial in v given by 
h(v) = E{[(X - mn) + o(¥ = ye)?’ 


Then h(v) > 0, for all v. Hence, the discriminant of h(v) is less than or equal to 0. 
To obtain the discriminant, we expand h(v) as 


h(v) = of + 2upaya2 + v7.03. 
Hence, the discriminant of h(v) is 4p?0703 — 40307. Since this is less than or equal 
to 0, we have 

Ap*a%a2 < 4020? or p* <1, 
which is the result sought. 


Theorem 2.5.2. If X and Y are independent random variables then cou(X,Y) = 0 
and, hence, p = 0. 


Proof: Because X and Y are independent, it follows from expression (2.4.3) that 
E(XY) = E(X)E(Y). Hence, by (2.5.2) the covariance of X and Y is 0; i.e., p = 0. 
| 

As the following example shows, the converse of this theorem is not true: 


Example 2.5.3. Let X and Y be jointly discrete random variables whose distri- 
bution has mass 1/4 at each of the four points (—1,0),(0,—1), (1,0) and (0,1). It 
follows that both X and Y have the same marginal distribution with range {—1,0, 1} 
and respective probabilities 1/4,1/2, and 1/4. Hence, 41 = ~w2 = 0 and a quick cal- 
culation shows that E(XY) = 0. Thus, p = 0. However, P(X = 0,Y = 0) = 0 
while P(X = 0)P(Y = 0) = (1/2)(1/2) = 1/4. Thus, X and Y are dependent but 
the correlation coefficient of X and Y is 0. m 
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Although the converse of Theorem 2.5.2 is not true, the contrapositive is; i.e., 
if p # 0 then X and Y are dependent. For instance, in Example 2.5.1, since 
p = 0.816, we know that the random variables X; and X2 discussed in this example 
are dependent. As discussed in Section 10.8, this contrapositive is often used in 
Statistics. 

Exercise 2.5.7 points out that in the proof of Theorem 2.5.1, the discriminant 
of the polynomial h(v) is 0 if and only if p = £1. In that case X and Y are linear 
functions of one another with probability one; although, as shown, the relationship is 
degenerate. This suggests the following interesting question: When p does not have 
one of its extreme values, is there a line in the ry-plane such that the probability 
for X and Y tends to be concentrated in a band about this line? Under certain 
restrictive conditions this is, in fact, the case, and under those conditions we can 
look upon p as a measure of the intensity of the concentration of the probability for 
X and Y about that line. 

We summarize these thoughts in the next theorem. For notation, let f(x,y) 
denote the joint pdf of two random variables X and Y and let f(x) denote the 
marginal pdf of X. Recall from Section 2.3 that the conditional pdf of Y, given 
X =, 18 


f(x,y) 
f(x) 


at points where f;(2) > 0, and the conditional mean of Y, given X = 2, is given by 


fair (y|z) = 


yf (x,y) dy 


B(v|a) =f yfanlule) dy = uo 


when dealing with random variables of the continuous type. This conditional mean 
of Y, given X = 2, is, of course, a function of x, say u(a). In a like vein, the 
conditional mean of X, given Y = y, is a function of y, say v(y). 

In case u(x) is a linear function of x, say u(x) = a+ bx, we say the conditional 
mean of Y is linear in x; or that Y has a linear conditional mean. When u(x) = 
a+ bx, the constants a and b have simple values which we show in the following 
theorem. 


Theorem 2.5.3. Suppose (X,Y) have a joint distribution with the variances of X 
and Y finite and positive. Denote the means and variances of X and Y by 41, ba 
and 07,03, respectively, and let p be the correlation coefficient between X and Y. If 
E(Y|X) is linear in X then 


E(V|X) = wa + p> (X ~ tn) (2.5.4) 


and 


E(Var(Y |X)) = o3(1 — p”). (25.5) 


Proof: The proof is given in the continuous case. The discrete case follows similarly 
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by changing integrals to sums. Let E(Y |x) = a+ bx. From 


I. yf (x,y) dy 
E(Y |x) = 7 ae =a+bz, 


we have 


/ ” aflea) = GEAR). (2.5.6) 


If both members of Equation (2.5.6) are integrated on «, it is seen that 
E(Y) =a+bE(X) 


or 


pg = at buy, (2.5.7) 
where pf; = E(X) and pe = E(Y). If both members of Equation (2.5.6) are first 
multiplied by z and then integrated on x, we have 


E(XY) = aE(X) + bE(X?), 


or 


P0102 + pipe = apn + b(o7 + 17), (2.5.8) 
where po\02 is the covariance of X and Y. The simultaneous solution of equations 
(2.5.7) and (2.5.8) yields 
a= [2 25 and b= mes 
O71 O1 


These values give the first result (2.5.4). 
Next, the conditional variance of Y is given by 


Var(Y |r) = i, ly i ~ 2 e—m)] fantole dy 


—co 


[. [iw ~ p2e—m)] edd 


== (2.5.9) 


This variance is nonnegative and is at most a function of « alone. If it is multiplied 
by fi (a) and integrated on x, the result obtained is nonnegative. This result is 


fof fom -r2e-mreset 


O71 


fore) love) oO 2 
= / / uy — na)? — 292 (y — pal — mn) + Se — a)? f(x,y) dyda 
—oco J —oo 1 O7 


= El(¥ - 1)? ~ 2p Bl(X ju)(Y — p)] 4 pe EUX ~ ms)" 


2 
OC oO 

2 2 20D 2 

= 03 —2p—po102+ p07} 
O71 O71 


= 93 — 2p*03 + p’o3 = 03(1— p’), 
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which is the desired result. m 


Note that if the variance, Equation (2.5.9), is denoted by k(a), then E[k(X)] = 
o3(1—p?) > 0. Accordingly, p? < 1, or -1 < p < 1. This verifies Theorem 2.5.1 
for the special case of linear conditional means. 

As a corollary to Theorem 2.5.3, suppose that the variance, Equation (2.5.9), is 
positive but not a function of x; that is, the variance is a constant k > 0. Now if k 
is multiplied by f1(x) and integrated on x, the result is k, so that k = 03(1— p). 
Thus, in this case, the variance of each conditional distribution of Y, given X = z, is 
o3(1—p?). If p = 0, the variance of each conditional distribution of Y, given X = 2, 
is 0%, the variance of the marginal distribution of Y. On the other hand, if p? is near 
1, the variance of each conditional distribution of Y, given X = a, is relatively small, 
and there is a high concentration of the probability for this conditional distribution 
near the mean E(Y |x) = yo + p(o2/o1)(x — 1). Similar comments can be made 
about E(X|y) if it is linear. In particular, E(X|y) = pa + p(o1/o2)(y — pe) and 
B[Var(X|Y)] = 03(1 — 2). 


Example 2.5.4. Let the random variables X and Y have the linear conditional 
means E(Y|x) = 4¢ +3 and E(X|y) = zy — 3. In accordance with the general 
formulas for the linear conditional means, we see that E(Y |x) = pe if = py and 
E(X|y) = ps if y = pe. Accordingly, in this special case, we have w2 = 441 +3 
and py = hl —3so that yy = 8 and fg = —12. The general formulas for the 
linear conditional means also show that the product of the coefficients of x and y, 
respectively, is equal to p? and that the quotient of these coefficients is equal to 
o3/o7. Here p? = 4(75) = § with p = § (not —5), and o/o7 = 64. Thus, from the 
two linear conditional means, we are able to find the values of p11, 2, p, and 02/01, 
but not the values of 0; and o9. & 


E(Y|x) = bx 


BN (0, 0) h 


a 


Figure 2.5.1: Illustration for Example 2.5.5. 
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Example 2.5.5. To illustrate how the correlation coefficient measures the intensity 
of the concentration of the probability for X and Y about a line, let these random 
variables have a distribution that is uniform over the area depicted in Figure 2.5.1. 
That is, the joint pdf of X and Y is 


1 


flea) ={ 


-—at+ba<y<a+bs, -—h<au<h 
elsewhere. 


We assume here that b > 0, but the argument can be modified for b < 0. It is easy 
to show that the pdf of X is uniform, namely 


a+ba 
0 elsewhere. 


The conditional mean and variance are 
2 


E(Y |r) =ba and var(Y|x) = = 


From the general expressions for those characteristics we know that 
2 


a9: Bhd es a 
p22 and = o9(1~p%) 


Additionally, we know that 0? = h?/3. If we solve these three equations, we obtain 
an expression for the correlation coefficient, namely 


bh 
0 VEO 
Referring to Figure 2.5.1, we note 


1. As a gets small (large), the straight-line effect is more (less) intense and p is 
closer to 1 (0). 


2. As h gets large (small), the straight-line effect is more (less) intense and p is 
closer to 1 (0). 


3. As b gets large (small), the straight-line effect is more (less) intense and p is 
closer to 1 (0). m 


Recall that in Section 2.1 we introduced the mgf for the random vector (X,Y). 
As for random variables, the joint mgf also gives explicit formulas for certain mo- 
ments. In the case of random variables of the continuous type, 


ok+™ M (ty, ta) eres 
ye x y dxd 
aoe af [. f(x,y) dxdy, 
so that 


OF+™ M (ty, ta) 
— Otkatr 


i 2 ky™ f(x,y) dady = E(X*Y™). 
ty =t2=0 
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For instance, in a simplified notation that appears to be clear, 


7 _ @M(0,0) 
In = E(X)= Oh 
OM(0,0) 
= (YY) =:—_——- 
1) ) Dis 
62M (0,0 
oy = E(X*)-u{= oe) )_ 42 
1 
aM (0,0 
oy = E(Y*)-p3= oe) )_ 42 
2 
2M (0,0) 
El(X — p)(Y — p2)] Oho, NH» (2.5.10) 


and from these we can compute the correlation coefficient p. 

It is fairly obvious that the results of equations (2.5.10) hold if X and Y are 
random variables of the discrete type. Thus the correlation coefficients may be com- 
puted by using the mgf of the joint distribution if that function is readily available. 
An illustrative example follows. 


Example 2.5.6 (Example 2.1.10, Continued). In Example 2.1.10, we considered 
the joint density 

Je’ 0<4<y<w 
f(x,y) = { 0 elsewhere, 


and showed that the mgf was 


1 


M(t) = Gada) 


for t) + tg < 1 and tg < 1. For this distribution, equations (2.5.10) become 


o=1, of =2 5.11) 


Verification of (2.5.11) is left as an exercise; see Exercise 2.5.5. If, momentarily, we 
accept these results, the correlation coefficient of X and Y is p=1/V2. m= 


EXERCISES 
2.5.1. Let the random variables X and Y have the joint pmf 
(a) p(x,y) = 3, (x,y) = (0,0), (1, 1), (2, 2), zero elsewhere. 
(b) p(x,y) = 4, (x,y) = (0,2), (1, 1), (2,0), zero elsewhere. 
i 
3? 


(c) p(z,y) = =, (x,y) = (0,0), (1, 1), (2,0), zero elsewhere. 
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In each case compute the correlation coefficient of X and Y. 


2.5.2. Let X and Y have the joint pmf described as follows: 


(zy) |) (2) 0,3) @1) ©@ 


2) (2,3) 


al’ 
ol 
ale 
al 
al 
al 


p(x, y) 


and p(a,y) is equal to zero elsewhere. 


(a) Find the means 4; and pz, the variances 0? and o3, and the correlation 


coefficient p. 


(b) Compute E(Y|X = 1), E(Y|X = 2), and the line pz + p(o2/o1)(% — p11). Do 
the points [k, E(Y|X =k)], k = 1,2, lie on this line? 


2.5.3. Let f(x,y) = 2,0<a<y,0<y <1, zero elsewhere, be the joint pdf of 
X and Y. Show that the conditional means are, respectively, (1+.2)/2, 0<a<1, 
and y/2, 0< y <1. Show that the correlation coefficient of X and Y is p= 4. 


2.5.4. Show that the variance of the conditional distribution of Y, given X = zg, in 
Exercise 2.5.3, is (1 — x)?/12, 0 < x < 1, and that the variance of the conditional 
distribution of X, given Y = y, is y7/12, 0<y< 1. 


2.5.5. Verify the results of equations (2.5.11) of this section. 


2.5.6. Let X and Y have the joint pdf f(#,y) = 1, -~@<y<2,0<a<1, 
zero elsewhere. Show that, on the set of positive probability density, the graph of 
E(Y |x) is a straight line, whereas that of E(X|y) is not a straight line. 


2.5.7. In the proof of Theorem 2.5.1, consider the case when the discriminant of 
the polynomial h(v) is 0. Show that this is equivalent to p = +1. Consider the case 
when p = 1. Find the unique root of h(v) and then use the fact that h(v) is 0 at 
this root to show that Y is a linear function of X with probability 1. 


2.5.8. Let w(ti,t2) = log M(t1, t2), where M(ti, tz) is the mgf of X and Y. Show 


that aw(0,0) 2 (0,0) 
(0,0 W(0,0) 
Oh °° of i= 1,2, 


and 
(0,0) 
Ot  Ot2 
yield the means, the variances, and the covariance of the two random variables. 


Use this result to find the means, the variances, and the covariance of X and Y of 
Example 2.5.6. 


2.5.9. Let X and Y have the joint pmf p(x, y) = z, (0,0), (1,0), (0,1), (1, 1), (2, 1), 
(1,2), (2,2), zero elsewhere. Find the correlation coefficient p. 


2.5.10. Let X1 and X2 have the joint pmf described by the following table: 
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(21,22) (0, 0) (0, 1) (0, 2) (1 


1) (1,2) (2,2) 


ble 
sls 
le 


eke 
12 


bly 


p(@1, £2) is 


Find pi (21), p2(x2), f1, 2,07, 03, and p. 


2.5.11. Let 0? = 03 =o? be the common variance of X; and X2 and let p be the 
correlation coefficient of X, and X2. Show for k > 0 that 


P(X ~ pn) + (Xa — pn)] > bo] < 2) 


2.6 Extension to Several Random Variables 


The notions about two random variables can be extended immediately to n random 
variables. We make the following definition of the space of n random variables. 


Definition 2.6.1. Consider a random experiment with the sample space C. Let 
the random variable X; assign to each element c € C one and only one real num- 
ber X;(c) = aj, 1 = 1,2,...,n. We say that (X1,...,Xn) is an n-dimensional 
random vector. The space of this random vector is the set of ordered n-tuples 
D = {(@1,%2,..-,2n) 141 = Xi(C),...,2n = Xn(c),c € C}. Furthermore, let A be 
a subset of the space D. Then P[(Xj,...,Xn) € A] = P(C), where C = {c:c € 
C and (X1(c), Xa(c),..., Xn(c)) € A}. 


In this section, we often use vector notation. We denote (X1,...,Xn)/ by the 
n-dimensional column vector X and the observed values (#1,...,%»,)’ of the random 
variables by x. The joint cdf is defined to be 


Fx(x) = P[X, < a1,...,Xn < Un]. (2.6.1) 


We say that the n random variables X), X2,...,X» are of the discrete type or 
of the continuous type and have a distribution of that type according to whether 
the joint cdf can be expressed as 


Fx(x) => ve S\ p(wr,..-, Wn); 


w1<21,...,Wn Sen 


Feo) = fff plwigs stm) dion dry. 


For the continuous case, 


or as 


eo” 
RRO Meo a ha (2.6.2) 


except possibly on points that have probability zero. 

In accordance with the convention of extending the definition of a joint pdf, 
it is seen that a continuous function f essentially satisfies the conditions of being 
a pdf if (a) f is defined and is nonnegative for all real values of its argument(s) 
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and (b) its integral over all real values of its argument(s) is 1. Likewise, a point 
function p essentially satisfies the conditions of being a joint pmf if (a) p is defined 
and is nonnegative for all real values of its argument(s) and (b) its sum over all real 
values of its argument(s) is 1. As in previous sections, it is sometimes convenient 
to speak of the support set of a random vector. For the discrete case, this would be 
all points in D that have positive mass, while for the continuous case these would 
be all points in D that can be embedded in an open set of positive probability. We 
use S to denote support sets. 


Example 2.6.1. Let 


e—(e+ytz) 


Fen2)={ 0 


0<2,y,z< co 
elsewhere 


be the pdf of the random variables X, Y, and Z. Then the distribution function of 
X,Y, and Z is given by 


F(a,y,z) = P(X <a2,Y <y,Z<z 
y y 
y 


_ | ew didvdin 
0 


o 
o 


= Gd=¢ "le" e"*); O< 2,y,2z <0, 
and is equal to zero elsewhere. The relationship (2.6.2) can easily be verified. m 


Let (X1, Xo,...,Xn) be a random vector and let Y = u(X), X2,...,Xn) for 
some function u. As in the bivariate case, the expected value of the random variable 
exists if the n-fold integral 


co co 
/ af |u(a1,@2,---,2n)|f (a1, @2,..-,2n) dxidrg--- dry, 
—oo —oo 
exists when the random variables are of the continuous type, or if the n-fold sum 


~ + oo |u(a1,22,---,2n)|p(a1, 22,---,2n) 
Ln © 


exists when the random variables are of the discrete type. If the expected value of 
Y exists, then its expectation is given by 


E(Y) =| af u(@1,@2,-.-,0n)f(%1,02,..-,%n) dxydx2-+--dt, (2.6.3) 


for the continuous case, and by 


E(Y)= > ves S 7 ula, x2, 21g Xn OIL, €9;.0225%n) (2.6.4) 


Tn ry 


for the discrete case. The properties of expectation discussed in Section 2.1 hold 
for the n-dimensional case also. In particular, F is a linear operator. That is, if 
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Yj; = uj(X1,...,Xn) for 7 =1,...,m and each E(Y;) exists, then 


E |) Yj) =) EM), (2.6.5) 
j=l j=l 
where ky,..., km are constants. 


We next discuss the notions of marginal and conditional probability density 
functions from the point of view of n random variables. All of the preceding defini- 
tions can be directly generalized to the case of n variables in the following manner. 
Let the random variables X1, X2,...,X» be of the continuous type with the joint 
pdf f(a1,22,...,2%,). By an argument similar to the two-variable case, we have for 
every 0, 


b 
Fx Q)=P =f Alzada, 


where f\(x1) is defined by the (n — 1)-fold integral 


files) = f af f(t1,@2,...,%n) dtg-+- dtp. 


Therefore, fi(a1) is the pdf of the random variable X; and f;(x1) is called the 
marginal pdf of X;. The marginal probability density functions fo(a2),..., fn(@n) 
of Xg,..., Xn, respectively, are similar (n — 1)-fold integrals. 

Up to this point, each marginal pdf has been a pdf of one random variable. 
It is convenient to extend this terminology to joint probability density functions, 
which we do now. Let f(a#1,22,...,%n) be the joint pdf of the n random variables 
X 1, X2,...,Xn, just as before. Now, however, take any group of k < n of these 
random variables and find the joint pdf of them. This joint pdf is called the marginal 
pdf of this particular group of k variables. To fix the ideas, take n = 6, k = 3, and 
let us select the group X2, X4, X5. Then the marginal pdf of X2, X4, X5 is the joint 
pdf of this particular group of three variables, namely, 


CO lo e) lee) 
/ / / f (v1, 22,23, 24,25, 06) dx drg3dr¢, 
— oO —-cC — Oo). 


if the random variables are of the continuous type. 
Next we extend the definition of a conditional pdf. Suppose fi(a1) > 0. Then 


we define the symbol fo... nj1(@2,.--,@n|v1) by the relation 
f (x1, 22,.--,%n) 
nli(©2,---,2Ln|21) = ————_—__., 
fa, ni(2y--- yl) = 


bei ni1(©2,---,%n|1) is called the joint conditional pdf of X2,...,Xn, 
given X; = 2%. The joint conditional pdf of any n — 1 random variables, say 
Xy,...,Xj-1, Xi41,---, Xn, given X; = x, is defined as the joint pdf of X1,..., Xn 
divided by the marginal pdf f;(x;), provided that f;(z;) > 0. More generally, the 
joint conditional pdf of n—k of the random variables, for given values of the remain- 
ing & variables, is defined as the joint pdf of the n variables divided by the marginal 
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pdf of the particular group of k variables, provided that the latter pdf is positive. 
We remark that there are many other conditional probability density functions; for 
instance, see Exercise 2.3.12. 

Because a conditional pdf is the pdf of a certain number of random variables, 
the expectation of a function of these random variables has been defined. To em- 
phasize the fact that a conditional pdf is under consideration, such expectations 
are called conditional expectations. For instance, the conditional expectation of 
u(Xo,...,Xn), given X1; = 21, is, for random variables of the continuous type, 
given by 


Blu(Xa,.-,Xn)ea] = f of eee er ee a ye ee 


provided fi(#1) > 0 and the integral converges (absolutely). A useful random 
variable is given by h(X,) = E[u(X2,...,Xn)|X1)]- 

The above discussion of marginal and conditional distributions generalizes to 
random variables of the discrete type by using pmfs and summations instead of 
integrals. 

Let the random variables X1, X2,..., Xn have the joint pdf f(x1,x2,...,@n,) and 
the marginal probability density functions f1(a1), fo(x2),..-, fn(@n), respectively. 
The definition of the independence of X; and X2 is generalized to the mutual 
independence of X,, X2,..., Xn as follows: The random variables X1, X2,...,Xn 
are said to be mutually independent if and only if 


Ff (2a, Ba; see 52) SS fi (x1) fo(v2) nae fal@u)s 


for the continuous case. In the discrete case, X1, X2,..., Xn are said to be mutually 
independent if and only if 


P(@1,£2,---, Ln) = p1(x1)p2(L2) +++ Pn (Fn). 
Suppose X,, X2,...,X, are mutually independent. Then 


P(a, < X1 < by, ag < Xq < be,...,dn < Xn < bn) 
= P(a, < X1 < b1)P(ag < X2 < b2)-++ Plan < Xn < bn) 


= [Pt < X; < bi), 


i=1 


where the symbol J]}_, (i) is defined to be 


The theorem that 


138 Multivariate Distributions 


for independent random variables X; and X2 becomes, for mutually independent 
random variables X,, X2,...,Xn, 


Elui(X1)u2(X2) +++ Un(Xn)] = Blur (X1)) [ue (X2)] +++ E[un(Xn)], 


E II ui(Xi)| = II Efus(X;)). 


The moment-generating function (mgf) of the joint distribution of n random 
variables X 1, X2,...,X» is defined as follows. Suppose that 


Elexp(t:X1 + to Xo fteee ft trXn)| 


exists for —h; < t; < hj, i= 1,2,...,n, where each h; is positive. This expectation 
is denoted by M(t1,t2,...,tn) and it is called the mef of the joint distribution of 
X1,...,Xn (or simply the mgf of Xj,...,X,,). As in the cases of one and two 
variables, this mgf is unique and uniquely determines the joint distribution of the 
n variables (and hence all marginal distributions). For example, the megf of the 
marginal distributions of X; is M(0,...,0,t;,0,...,0), 2 = 1,2,...,n; that of the 
marginal distribution of X; and X, is M/(0,...,0,t:,0,...,0,t;,0,...,0); and so on. 
Theorem 2.4.5 of this chapter can be generalized, and the factorization 


Mi iti y.cesta) = |] MOj-- 25 0,44,0,..,0) (2.6.6) 
i=1 
is a necessary and sufficient condition for the mutual independence of Xj, X2,..., Xn. 


Note that we can write the joint mgf in vector notation as 
M(t) = Elexp(t’X)], forte BC R’, 


where B= {t: —h; < tj <hi,i=1,...,n}. 

The following is a theorem that proves useful in the sequel. It gives the mef of 
a linear combination of independent random variables. 
Theorem 2.6.1. Suppose X1, X2,...,Xn aren mutually independent random vari- 
ables. Suppose, for alli = 1,2,...,n, X; has mgf M;(t), for —hi < t < hj, where 
h;i > 0. Let T = en k;X;, where ky, ko,...,kn are constants. Then T has the 
magf given by 


Mr(t) = [Lice - min{ hi} LES min{ hi}. (2.6.7) 


i=l 


Proof. Assume t is in the interval (— min; {h;},min;{h;}). Then, by independence, 


re 
i=1 
lle ied [Lice 

i=1 


i=l 


Mr(t) = E [Rta Xs) _f 


which completes the proof. m 
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Example 2.6.2. Let X,, X2, and X3 be three mutually independent random vari- 
ables and let each have the pdf 


27 O0<e <1 
f(z) = { 0 elsewhere. (268) 


The joint pdf of X1, X2, X3 is f(a1) f(v2)f (a3) = 8aiver3, O< aj <1, 1 =1,2,3, 


zero elsewhere. Then, for illustration, the expected value of 5X,X3 + 3X2XF is 


1 pl pl 
| | | (52,23 + 3x9x$)821 2223 dx ,dxgdzxr3 = 2. 
0 Jo Jo 
Let Y be the maximum of X,, X2, and X3. Then, for instance, we have 


PY <h) = P(X <4,%<4,X3 <4) 


1/2 1/2 1/2 
| if | 821 2%2%3 dz, dx2dx3 
0 0 0 


_ 1)\6 1 
= (3)! =a: 


0 y<O 
Gyj=PY <y)=¢ x O<sy<1 
1 dis<y 


Accordingly, the pdf of Y is 


_ f 6y? O<y<1 
gy) = { 0 elsewhere. 


Remark 2.6.1. If X1, X2, and X3 are mutually independent, they are pairwise 
independent (that is, X; and X,, i 4 j, where i,7 = 1,2,3, are independent). 
However, the following example, attributed to S. Bernstein, shows that pairwise 
independence does not necessarily imply mutual independence. Let X1, X2, and X3 
have the joint pmf 


1 
_ |g (yes, 23) = 10, 0,0), 1,0),(0,0, 1);.(1,1,1)} 
P(21, 22,23) = { 0 elsewhere. 


The joint pmf of X; and X;, iF j, is 


+ (xi, 25) € {(0,0), (1,0), (0,1), (1, 1)} 
Pig (Bin Bj) = { 0 elsewhere, 


whereas the marginal pmf of X; is 


i= 0, 1 
elsewhere. 


S 
& 
~—S* 
lI 
—“— 
Onir 
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Obviously, if i # 7, we have 
pi (xi, 2j) = pi(ai)p;(zq), 
and thus X; and X; are independent. However, 


p(©1, ©2, £3) = p1(X1)p2(#2)p3(x3). 


Thus X1, X2, and X3 are not mutually independent. 

Unless there is a possible misunderstanding between mutual and pairwise inde- 
pendence, we usually drop the modifier mutual. Accordingly, using this practice in 
Example 2.6.2, we say that X , X2, X3 are independent random variables, meaning 
that they are mutually independent. Occasionally, for emphasis, we use mutually 
independent so that the reader is reminded that this is different from pairwise in- 
dependence. 

In addition, if several random variables are mutually independent and have 
the same distribution, we say that they are independent and identically dis- 
tributed, which we abbreviate as iid. So the random variables in Example 2.6.2 
are iid with the common pdf given in expression (2.6.8). 


The following is a useful corollary to Theorem 2.6.1 for iid random variables. Its 
proof is asked for in Exercise 2.6.7. 


Corollary 2.6.1. Suppose X1,X2,...,Xn are tid random variables with the com- 
mon mgf M(t), for -—h<t<h, whereh >0. Let T = 37, X;. Then T has the 


mogf given by 
Mr(t) =[M(t)]", -h<t<h. (2.6.9) 


2.6.1 *Multivariate Variance-Covariance Matrix 


This section makes explicit use of matrix algebra and it is considered as an optional 
section. 

In Section 2.5 we discussed the covariance between two random variables. In 
this section we want to extend this discussion to the n-variate case. Let X = 
(X1,...,Xn)/ be an n-dimensional random vector. Recall that we defined E(X) = 
(E(X,),...,E(X,))’, that is, the expectation of a random vector is just the vector 
of the expectations of its components. Now suppose W is an m x n matrix of 
random variables, say, W = [W;,;] for the random variables W;;, 1 < i < m and 
1<j<n. Note that we can always string out the matrix into an mn x 1 random 
vector. Hence, we define the expectation of a random matrix 


E[W] = [E(Wi;)]. (2.6.10) 


As the following theorem shows, the linearity of the expectation operator easily 
follows from this definition: 


Theorem 2.6.2. Let W, and W2 be m x n matrices of random variables, let Ay 
and Ag be k x m matrices of constants, and let B be ann x | matrix of constants. 
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Then 


E[A,W, + A,W2] = A,E[Wi] + AsE[Wo] (2.6.11) 
E[A, WB] A, EW) |B. (2.6.12) 


Proof: Because of the linearity of the operator E on random variables, we have for 
the (i, 7)th components of expression (2.6.11) that 


m 


m m 
E x ayisW 5; + azisWosj | = > arisE[Wis3] + azisE|Woz)]. 
s=1 s=1 s=1 s=1 


m 


Hence by (2.6.10), expression (2.6.11) is true. The derivation of expression (2.6.12) 
follows in the same manner. @ 


Let X = (Xi,...,X,)/ be an n-dimensional random vector, such that 0? = 
Var(X;) < co. The mean of X is u = E[X] and we define its variance-covariance 
matrix as 


Cov(X) = E[(X — p)(X — p)'] = [oi], (2.6.13) 


where o;; denotes o?. As Exercise 2.6.8 shows, the ith diagonal entry of Cov(X) is 
o? = Var(X;) and the (i, j)th off diagonal entry is Cov(X;, X;). 

Example 2.6.3 (Example 2.5.6, Continued). In Example 2.5.6, we considered 
the joint pdf 


Je’ 0<2<y¥<w 
F(a,y) = { 0 elsewhere, 


and showed that the first two moments are 


eq =1, o=2 (2.6.14) 


Let Z = (X,Y)’. Then using the present notation, we have 


1 1 1 
ez|=| 3 | and cov(Z) =| | Al rT] 
Two properties of Cov(X;,X,;) needed later are summarized in the following 


theorem: 


Theorem 2.6.3. Let X = (X1,...,Xn)/ be an n-dimensional random vector, such 
that 0? = 0%; = Var(X;) < oo. Let A be anm x n matrix of constants. Then 


Col(X) = E[XX'] — py’ (2.6.15) 
Cov(AX) ACou(X)A’. (2.6.16) 


l| 
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Proof: Use Theorem 2.6.2 to derive (2.6.15); i-e., 


Cov(X) = E[(X—-p)(X— p)'] 
= E[XX'— pX'—Xp'+ pp’) 
E[XX’] — pE[X’] — E[X]p' + py’, 


which is the desired result. The proof of (2.6.16) is left as an exercise. m 


All variance-covariance matrices are positive semi-definite matrices; that is, 
a’Cov(X)a > 0, for all vectors a € R”. To see this let X be a random vector and 
let a be any n x 1 vector of constants. Then Y = a’X is a random variable and, 
hence, has nonnegative variance; i.e., 


0 < Var(Y) = Var(a’X) = a'Cov(X)a; (2.6.17) 


hence, Cov(X) is positive semi-definite. 


EXERCISES 


2.6.1. Let X,Y, Z have joint pdf f(z,y,z) = 2(r+y+2)/3, 0<a<10<y< 
1, 0 < z <1, zero elsewhere. 


(a) Find the marginal probability density functions of X,Y, and Z. 


(b) Compute P(0< X <$,0<Y <4,0<Z< 4) and P(0< X <$)=P(0< 
Y<tj=POW<Z <>). 


(c) Are X,Y, and Z independent? 


(d) Calculate E(X?Y Z + 3XY+Z?). 


wa 


(e) Determine the cdf of X,Y, and Z. 


(f) Find the conditional distribution of X and Y, given Z = z, and evaluate 
BLY |Z). 


(g) Determine the conditional distribution of X, given Y = y and Z = z, and 
compute E(X|y, z). 


2.6.2. Let f(x, 22,23) = exp[—(a1 + 22 + 23)], O< a1 < wo, 0< a2 << Ww, 0< 
x3 < oo, zero elsewhere, be the joint pdf of X1, X2, X3. 


(a) Compute P(X, < Xq < X3) and P(X; = X2 < X3). 


(b) Determine the joint mgf of X1,X2, and X3. Are these random variables 
independent? 


2.6.3. Let X),X2,X3, and X4 be four independent random variables, each with 
pdf f(x) = 3(1—2)?, 0 < x < 1, zero elsewhere. If Y is the minimum of these four 
variables, find the cdf and the pdf of Y. 

Hint: PY >y)= P(X; >y,1=1,...,4). 
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2.6.4. A fair die is cast at random three independent times. Let the random variable 
X;, be equal to the number of spots that appear on the ith trial, 7 = 1,2,3. Let the 
random variable Y be equal to max(X;). Find the cdf and the pmf of Y. 

Hint: PY <4) = POG < 9.7 =1, 2,3): 


2.6.5. Let M(t1,t2,t3) be the mgf of the random variables X1, X2, and X3 of 
Bernstein’s example, described in the remark following Example 2.6.2. Show that 


M (ti, t2,0) = M(t1,0,0)M(0, t2,0), M(t1,0,t3) = M(t1,0,0)M(0, 0, ts), 


and 
M(0, te, t3) = M(0, ta, 0)M(0, 0, t3) 


are true, but that 
M (ti, te, t3) A M(t1,0,0)M(0, to, 0)M (0,0, ts). 
Thus X,, X2, X3 are pairwise independent but not mutually independent. 


2.6.6. Let X 1, X2, and X3 be three random variables with means, variances, and 
correlation coefficients, denoted by s1, (2, 3;07,03,03; and p12, p13, P23, Tespec- 
tively. For constants by and bs, suppose F(X4—11|22, 73) = b2(x2—f2)+b3 (x3 —p3). 
Determine bz and 63 in terms of the variances and the correlation coefficients. 


2.6.7. Prove Corollary 2.6.1. 


2.6.8. Let X = (Xj,...,X,)/ bean n-dimensional random vector, with the variance- 
covariance matrix given in display (2.6.13). Show that the ith diagonal entry of 
Cov(X) is 0? = Var(X;) and that the (7, j)th off diagonal entry is Cov(X;, X;). 


2.6.9. Let X1, X2, X3 be iid with common pdf f(a) = exp(—a), 0 < a < c, zero 
elsewhere. Evaluate: 


(a) P(X, < X2|X1 < 2X2). 
(b) P(X, < Xo < X3|X3 < 1). 


2.7 ‘Transformations for Several Random Variables 


In Section 2.2 it was seen that the determination of the joint pdf of two functions of 
two random variables of the continuous type was essentially a corollary to a theorem 
in analysis having to do with the change of variables in a twofold integral. This 
theorem has a natural extension to n-fold integrals. This extension is as follows. 
Consider an integral of the form 


fof f(a1,@2,.--,%n) dx; dxg--- dry 
A 


taken over a subset A of an n-dimensional space S. Let 


w= uy (21,2, vigillin dy Y= U2(@1, 2,. eng iledyct Yn = lin Diy Edy .25 Bq), 
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together with the inverse functions 


1 = wii, Y2,---,Yn), La = WalYr, Yar. +sYn)s--+. En = Wal, Yas ---5 Yn) 


define a one-to-one transformation that maps S onto T in the yz, y2,---,Yn Space 
and, hence, maps the subset A of S onto a subset B of TJ. Let the first partial 
derivatives of the inverse functions be continuous and let the n by n determinant 
(called the Jacobian) 


Ov, Ot .,, Oey 
Oy Oy2 OYn 
Ox, Own Daz 
fe) Or On 
J= Y2 . 
Oty, Of, .,, O8n 
Oy Oy2 OYn 


not be identically zero in T. Then 


fof f(@1,@2,.--,@n) dxidrg--+ dry, 

A 

=f flwi(yr, +++ 5 Yn), W2(Y1) ++ +5 Undo +++) Wn(Y1,--+,Yn)||F| dyrdy2 +++ dyn. 
B 


Whenever the conditions of this theorem are satisfied, we can determine the joint pdf 
of n functions of n random variables. Appropriate changes of notation in Section 
2.2 (to indicate n-space as opposed to 2-space) are all that are needed to show 
that the joint pdf of the random variables Yj = ui(X1, X2,...,Xn), ---; Yn = 
Un(X1, X2,.-..,Xn), where the joint pdf of Xy,...,Xp is f(@1,...,%n), is given by 


HY, Y25--+1 Yn) = f[wi(yr,- ++, Yn)s++-sWn(Y1s---YndIIFI, 
where (y1, y2,---;Yn) € T, and is zero elsewhere. 
Example 2.7.1. Let X1, X2, X3 have the joint pdf 


A8x,x%0%3 O< 21 < 42 < 43 <1 


F(t1, 22,03) = { 0 elsewhere. ee) 


If ¥; = X1/Xo, Yo = X2/X3, and Y3 = X3, then the inverse transformation is given 
by 
Li = yiy2ys, T2 = yoys, and x3 = yg. 
The Jacobian is given by 
Y2U3) Y1Y3) Y1Y2 


0 ys yo | =yo¥3- 
0 0 1 


J 


Moreover, inequalities defining the support are equivalent to 


O0< yyoys, yiy2y3 < yey3, yoy3 < y3, and y3 <1, 
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which reduces to the support T of Yj, Yo, Y3 of 
T ={(y1,y2,y3): O< y <1,i=1,2,3}. 


Hence the joint pdf of Y;, Yo, Y3 is 


g(Y1, Y2, ys) = 48(yiy2ys) (yoys)yslyoys| 
_ f 48yy3y3 O< yi <1,i=1,2,3 
7 { 0 elsewhere. (2.7.2) 


The marginal pdfs are 


gly) = 2y1,0< y1 < 1,zero elsewhere 
ge(y2) = 4y3,0 < ye < 1,zero elsewhere 
g3(y3) = 6y3,0 < y3 < 1,zero elsewhere. 


Because g(y1, y2, 43) = 91(y1)92(y2)93(y3), the random variables Y;, Y2, Y3 are mu- 
tually independent. m 


Example 2.7.2. Let X1, X2, X3 be iid with common pdf 


e” 0<4%<@ 
f(z) = { 0 elsewhere. 


Consequently, the joint pdf of X1, X2, X3 is 


= = xy : <= 
e7 Dizi 0< a4; <w,i=1,2,3 
1, 02,3) = 
Ixy, Xo, Ns (Visa, BS) { 0 diseuthere: 


Consider the random variables Y;, Y2, Y3 defined by 


Yi Y3 and Y3 = X,4+ Xo4+ Xz. 


= Xi = XQ 
~~ X1+X2+X3’ ~ X1+Xo+X3? 


Hence, the inverse transformation is given by 
T1 = Yiy3, T2 = yoys, and ©3 = y3 — yiy3 — Yy2¥s, 


with the Jacobian 


y3 —ys l-y-— ye 


The support of X,, X2, X3 maps onto 
0 < y1y3 < co, 0 < yay3 < oo, and 0 < y3(1— yi — y2) < ~, 
which is equivalent to the support J given by 


T = {(y1, y2,y3): O< 1,0 < y2,0<1—y — y2,0 < yg < oof. 
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Hence the joint pdf of Y,, Y2, Y3 is 


9(y1, 42,3) =y3e, (yr, Y2,y3) € T. 


The marginal pdf of Y; is 


l-y foe) 
gi(y1) = i | yze dys dyz = 2(1—m1), O< ym <1, 
0 0 
zero elsewhere. Likewise the marginal pdf of Y2 is 


ge(y2) =2(1—ye), O<y <1, 


zero elsewhere, while the pdf of Y3 is 


Ll pl-y 1 
93(y3) = | ih yze% dyz dy, = suse, 0 < y3 < ©, 
0 0 


zero elsewhere. Because g(y1, y2, ¥3) # 91(41)92(y2)93(y3), Y1, Y2, ¥Y3 are dependent 
random variables. 
Note, however, that the joint pdf of Y; and Y3 is 


1-y1 
g13(y15 93) = | yze% dyo =(1—yi)yge, O<y1 <1,0<y <a, 
0 


zero elsewhere. Hence Y; and Y3 are independent. In a similar manner, Y2 and Y3 
are also independent. Because the joint pdf of Y; and Y2 is 


CoO 
nalyisue) = [ yze 8 dy3 =2, 0<y1,0<y, yi +y2 <1, 
10) 


zero elsewhere, Y; and Y2 are seen to be dependent. m 


We now consider some other problems that are encountered when transforming 
variables. Let X have the Cauchy pdf 


1 


Fe) = aa’ 


-wO<2< Ow, 

and let Y = X?. We seek the pdf g(y) of Y. Consider the transformation y = 27. 
This transformation maps the space of X, namely S = {x : —oo < x < oo}, onto 
T ={y:0<y < co}. However, the transformation is not one-to-one. To each 
y € T, with the exception of y = 0, there correspond two points « € S. For 
example, if y = 4, we may have either « = 2 or x = —2. In such an instance, we 
represent S as the union of two disjoint sets A, and A» such that y = x? defines 
a one-to-one transformation that maps each of A; and Az onto JT. If we take A; 
to be {x : —co < x < O} and A, to be {a : 0 < x < oo}, we see that Aj is 
mapped onto {y : 0 < y < co}, whereas Ay is mapped onto {y: 0 < y < ov}, 
and these sets are not the same. Our difficulty is caused by the fact that « = 0 
is an element of S. Why, then, do we not return to the Cauchy pdf and take 
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f(0) = 0? Then our new S is S = {—00 < x < co but x # 0}. We then take 
A, = {x : —o < x < O} and Ap = {4:0 < x < oo}. Thus y = 2’, with the 
inverse « = —,/y, maps A; onto T = {y: 0 < y < oo} and the transformation is 
one-to-one. Moreover, the transformation y = x7, with inverse « = Jy, maps A» 
onto T = {y: 0 < y < o} and the transformation is one-to-one. Consider the 
probability P(Y € B), where B CT. Let A3 = {w@: 2 = —,/y, y € B} C A; and 
let Ay = {a : a2 = \/y, y € B} C Ag. Then Y € B when and only when X € A3 or 
X € Ay. Thus we have 


P(Ye€B) = P(X € A3)4+ P(X € Ag) 
= f(a) dx + f(a) da. 
A3 Ag 
In the first of these integrals, let = —,/y. Thus the Jacobian, say Ji, is —1/2,/y; 
furthermore, the set A3 is mapped onto B. In the second integral let x = \/y. Thus 


the Jacobian, say Jz, is 1/2,/y; furthermore, the set A, is also mapped onto B. 
Finally, 


PVE B) = [icval-sa) at [ase 


il 
| FVD + FVD 


I 


Hence the pdf of Y is given by 


1 
gy) = slf(-vy) + fiVy))], YT. 
(y) ri (—Vy) + F(V¥)] 
With f(a) the Cauchy pdf we have 
— te O<Y< 00 
—2 mMIt+y)Vy 
ay) { 0 elsewhere. 


In the preceding discussion of a random variable of the continuous type, we had 
two inverse functions, 7 = —,/y and « = \/y. That is why we sought to partition 
S (or a modification of S) into two disjoint subsets such that the transformation 
y = x* maps each onto the same J. Had there been three inverse functions, we 
would have sought to partition S (or a modified form of S) into three disjoint 
subsets, and so on. It is hoped that this detailed discussion makes the following 
paragraph easier to read. 

Let f(a1,%2,...,%n) be the joint pdf of X1, X2,...,Xn, which are random vari- 
ables of the continuous type. Let S denote the n-dimensional space where this joint 
pdf f(a1,22,...,%n) > 0, and consider the transformation y; = u1(a1,¥2,.-.,2n), 

625 Un = Un(@1,£2,-.-,2n), which maps S onto T in the y1, y2,..-,Yn space. To 
each point of S there corresponds, of course, only one point in TJ; but to a point 
in J there may correspond more than one point in S. That is, the transformation 
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may not be one-to-one. Suppose, however, that we can represent S as the union of 
a finite number, say k, of mutually disjoint sets A,, Ag,..., A, so that 


Y1 = U1(@1, £2,--.,;En),--+) Yn = Un(£1, L2,---,; Ln) 


define a one-to-one transformation of each A; onto TJ. Thus to each point in T 
there corresponds exactly one point in each of A,, Ag,...,A,. For i =1,...,k, let 


r= wiilY1, Yr: eau 5 Yr) Bo = wai (Y1, Y2; ees’ ‘Ba ree In = WnilY1, Y2; Bers ye) 


denote the k groups of n inverse functions, one group for each of these k transfor- 
mations. Let the first partial derivatives be continuous and let each 


Ow, Own Whi 
i} i) OYn 
du3: dua: ae bw: 
Oy Oy2 Oyn 
Ji a . . . 2 = 1, 2, sf | k, 
Owni OWni ,,, DWni 
Oy1 Oy2 On 


be not identically equal to zero in TJ. Considering the probability of the union 
of & mutually exclusive events and by applying the change-of-variable technique 
to the probability of each of these events, it can be seen that the joint pdf of 
Yj, = u1(X1, Xo, wes Xn); 1) = ug(X, Xo, ae Xn), eeeg Ya => Un(X1, Xo, erate 5 Xn); 
is given by 


k 


G(Y1s Yar ---sYn) = D> flwralyt,--6sYn)y +s Wns(Y,---s Ya llFi, 
i=l 


provided that (yi, y2,---,;Yn) € TJ, and equals zero elsewhere. The pdf of any Y;, 
say Yi, is then 


nu) = | | G(Y1, Yas +++, Yn) dy2 +++ dyn. 


Example 2.7.3. Let X; and X2 have the joint pdf defined over the unit circle 
given by 


t O<at?+n2 <1 


f(@1, 22) = { 6 elsewhere. 


Let Y) = X? + X3 and Yo = X?/(X?+ X32). Thus yiyo = v7 and x3 = yi (1 — yz). 
The support S maps onto T = {(yi,y2): 0< y< 1,2 = 1,2}. For each ordered 
pair (yi, y2) € T, there are four points in S, given by 


(a1,%2) such that 2, = ./yys and x2 = V/yi(1 — y2) 

(a1,%2) such that 2, = ./yry2 and rg = —\/yi(1 — y2) 

(a1,%2) such that 2, = —,/yry2 and x2 = \/yi(1 — y2) 
and (#1,22) such that a2, = —,/yye and x2 = —\/yi(1 — yp). 
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The value of the first Jacobian is 


5Vy2/y1 sVui/ye 


i{- (aa \.! 1 
eer 4/ux(l =e) 


It is easy to see that the absolute value of each of the four Jacobians equals 
1/4,/ye2(1 — y2). Hence, the joint pdf of Y; and Y is the sum of four terms and can 
be written as 


Jy = 


l| 


1 


1 
g(y1, y2) = 4——__— ———— = (yi, 2) 
TA/yox(l—yo) rye Wo 


Thus Y, and Y2 are independent random variables by Theorem 2.4.1. m 


ET. 


Of course, as in the bivariate case, we can use the mgf technique by noting that 
if Y = g(X1, X2,..., Xn) is a function of the random variables, then the mgf of Y 
is given by 


E(e™) =| / of ee atria ®) f(a, a@2,..-,@n) daydxg--+dan, 


in the continuous case, where f(x1,22,...,%n) is the joint pdf. In the discrete case, 
summations replace the integrals. This procedure is particularly useful in cases in 
which we are dealing with linear functions of independent random variables. 


Example 2.7.4 (Extension of Example 2.2.6). Let X1, X2, X3 be independent ran- 
dom variables with joint pmf 


22 3 eH Bh2—H3 


Hy M3? Mae _ a 
p(x1,%2,%3) = aoe 1 Peta 0 line parmenrers ——iad Breer 
0 elsewhere. 


IfY = X,+X2.+ Xz, the mef of Y is 


E (e'Y) = # (a) 

E (et%1 @tX2 etXs) 

E (e**) E (et?) E (es) 

because of the independence of X1, X2, X3. In Example 2.2.6, we found that 
E (e***) = exp{ui(e’—1)}, i =1,2,3. 


Hence, 
B (e™) = exp{ (tn + p2 + ua)(e* — 1}. 
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This, however, is the mgf of the pmf 


=—(41+H2+H3) 


(Hitpo+bs)"%e =0.1.2 
= a y= 0,124.4 
Py(y) 0 elsewhere, 


so Y = X, + Xo + Xz has this distribution. m 
Example 2.7.5. Let X,, X2,X3,X4 be independent random variables with com- 
mon pdf 
e* x«>0 
F(a) = { 0 elsewhere. 
If Y = X,+ Xo+ X3 + X4, then similar to the argument in the last example, the 
independence of X,, X2, X3, X4 implies that 
E (e*”) =F (e**) E (e**?) E (eo) E (e**) : 
In Section 1.9, we saw that 
Ble) S(l=a)>) <1, 1, 8 4 
Hence, 
E(e®) =(1-t)*. 
In Section 3.3, we find that this is the mgf of a distribution with pdf 
i 


35-Y 0 
_ aye <y< oo 
fy) = { 0 elsewhere. 


Accordingly, Y has this distribution. m 
EXERCISES 


2.7.1. Let X1, X2, X3 be iid, each with the distribution having pdf f(a) = e~*, 0< 
x < oo, zero elsewhere. Show that 


X1 Xy + Xo 


Y= =—— , = 
ae eS ae oe ere 


¥3 =X, + X2+ X3 


are mutually independent. 


2.7.2. If f(x) = 4, —1 <2 <1, zero elsewhere, is the pdf of the random variable 
X, find the pdf of Y = X?. 

2.7.3. If X has the pdf of f(a) = i —1 <2 < 8, zero elsewhere, find the pdf of 
Y = X?. 

Hint: Here T = {y:0< y < 9} and the event Y € B is the union of two mutually 
exclusive events if B= {y:0<y< l]}. 


2.7.4. Let X1,X2, X3 be iid with common pdf f(x) = e~*, « > 0, 0 elsewhere. 
Find the joint pdf of Y; = X1, Yo = X, + Xe, and Y3 = X, + X24+ X3. 
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2.7.5. Let X1,X2, X3 be iid with common pdf f(a) = e~*, « > 0, 0 elsewhere. 
Find the joint pdf of Y; = X1/Xe2, Yo = X3/(X1 + X2), and Y3 = X, + X. Are 
Y,, Yo, Y3 mutually independent? 


2.7.6. Let X1,X2 have the joint pdf f(a1,v2) = 1/7, 0 < 2? +23 < 1. Let 
Y, = X? + X3 and Y2 = Xo. Find the joint pdf of Y; and Yo. 


2.7.7. Let X1, X2, X3,X4 have the joint pdf f(a1, 22, 23,74) = 24,0 < 21 <a < 
x3 < a4 < 1, 0 elsewhere. Find the joint pdf of Y; = X1/X2, Yo = X2/X3,Y3 = 
X3/X4,Y4 = X4 and show that they are mutually independent. 


2.7.8. Let X1, X2, X3 be iid with common mgf M(t) = ((3/4) + (1/4)e*)?, for all 
te R. 


(a) Determine the probabilities, P(X, = k),k = 0,1, 2. 


(b) Find the mgf of Y = X; + X2 + X3 and then determine the probabilities, 
P(Y =k),k =0,1,2,...,6. 


2.8 Linear Combinations of Random Variables 


In this section, we summarize some results on linear combinations of random vari- 
ables that follow from Section 2.6. These results will prove to be quite useful in 
Chapter 3 as well as in succeeding chapters. 

Let (X1,...,X»)/ denote a random vector. In this section, we consider linear 
combinations of these variables, writing them , generally, as 


n 
[= yee, (2.8.1) 
i=1 
for specified constants a1,...,@,. We obtain expressions for the mean and variance 


of T. 
The mean of T' follows immediately from linearity of expectation. For reference, 
we state it formally as a theorem. 


Theorem 2.8.1. Suppose T is given by expression (2.8.1). Suppose E(X;) — wi, 
fori =1,...,n. Then 


E(T) = Sy Cj [li (2.8.2) 


In order to obtain the variance of T’, we first state a general result on covariances. 


Theorem 2.8.2. Suppose T is the linear combination (2.8.1) and that W is another 
linear combination given by W = um biY;, for random variables Y\,...,Ym and 
specified constants by,...,bm. Let T = SS a;X; and let W = peer bY;. If 
E[X?] < 00, and E[Y?] < co fori=1,...,n andj =1,...,m, then 


Corll, W) = 3 sy: ajb; Cou( Xi, Y;). (2.8.3) 


i=1 j=1 
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Proof: Using the definition of the covariance and Theorem 2.8.1, we have the first 
equality below, while the second equality follows from the linearity of FE: 


Cov(T,W) = £\ 0 X; — a E(X;)) (bY; — bj E(Y;)) 


n m 


SoS. aid; BX - — E(X,))(¥; — E(¥;))], 


i=l j=1 


which is the desired result. m 


To obtain the variance of T, simply replace W by T in expression (2.8.3). We 
state the result as a corollary: 


Corollary 2.8.1. Let T = 0," a,X;. Provided E[X?] < oo, fori=1,...,n, 


Var(T) = Cov(T,T) -Ye Var(X;) + 2S ~ aja; Cou(X;, X;). (2.8.4) 
i<j 
Note that if X,,...,X», are independent random variables, then by Theorem 
2.5.2 all the pairwise covariances are 0; i.e., Cov(X;,X,;) = 0 for alli # j. This 
leads to a simplification of (2.8.4), which we record in the following corollary. 


Corollary 2.8.2. If X1,...,Xn are independent random variables and Var(X;) = 
o?, fori=1,...,n, then 


n 
Vat) = 5” alo?. (2.8.5) 
i=1 
Note that we need only X; and X; to be uncorrelated for all i 4 j to obtain this 
result. 
Next, in addition to independence, we assume that the random variables have 
the same distribution. We call such a collection of random variables a random 
sample which we now state in a formal definition. 


Definition 2.8.1. If the random variables X,,X2,...,Xn are independent and 
identically distributed, i.e. each X; has the same distribution, then we say that 
these random variables constitute a random sample of size n from that common 
distribution. We abbreviate independent and identically distributed by iid. m 


In the next two examples, we find some properties of two functions of a random 
sample, namely the sample mean and variance. 


Example 2.8.1 (Sample Mean). Let Xy,...,X, be independent and identically 

distributed random variables with common mean jz and variance 07. The sample 

mean is defined by X = n~! >", X;. This is a linear combination of the sample 

observations with a; = n~!; hence, by Theorem 2.8.1 and Corollary 2.8.2, we have 
2 


E(X) =p and Var(X) =<. (2.8.6) 


Because E(X) = 1, we often say that X is unbiased for ju. 
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Example 2.8.2 (Sample Variance). Define the sample variance by 


n 


S?=(n—-1) 1) °(%-X) = (n-1)71 (>: 5 eo =) (2.8.7) 
i=1 i=1 
where the second equality follows after some algebra; see Exercise 2.8.1. 

In the average that defines the sample variance S?, the division is by n — 1 
instead of n. One reason for this is that it makes S? unbiased for a7, as next 
shown. Using the above theorems, the results of the last example, and the facts 
that E(X?) = 0? + p? and E(x’) = (o7/n) + py”, we have the following: 


E(S2) = (n—1)7 (>: Bs n)) 
(n — 1)? {no? + np? — nf[(o?/n) + 4°}} 
a”. (2.8.8) 


Hence, $? is unbiased for o?. m 


EXERCISES 
2.8.1. Derive the second equality in expression (2.8.7). 


2.8.2. Let X1, X2, X3, X4 be four iid random variables having the same pdf f(x) = 
2x, 0 <a <1, zero elsewhere. Find the mean and variance of the sum Y of these 
four random variables. 


2.8.3. Let X; and X2 be two independent random variables so that the variances 
of X; and X_ are o? = k and o3 = 2, respectively. Given that the variance of 
Y =3X_9 — X, is 25, find k. 


2.8.4. If the independent variables X, and X2 have means 11, fl2 and variances 
a7, 03, respectively, show that the mean and variance of the product Y = X1X2 
are [1pl2 and ofo3 + p20 + p02, respectively. 


2.8.5. Find the mean and variance of the sum Y = oy X;, where X1,...,X5 are 
iid, having pdf f(x) = 6a(1 — x), 0 < a < 1, zero elsewhere. 


d 


2.8.6. Determine the mean and variance of the sample mean X = 57! a Ri 
where X1,..., X5 isarandom sample from a distribution having pdf f(x) = 423, 0 < 
x <1, zero elsewhere. 


2.8.7. Let X and Y be random variables with uw; = 1, we = 4, of = 4, oF = 
6, p= 5. Find the mean and variance of the random variable Z = 3X — 2Y. 


2.8.8. Let X and Y be independent random variables with means p44, fg and 
variances o7, 0. Determine the correlation coefficient of X and Z = X —Y in 
terms of [11, 12,07, 0%. 
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2.8.9. Let jz and o? denote the mean and variance of the random variable X. Let 
Y =c+ 0X, where 6 and c are real constants. Show that the mean and variance of 
Y are, respectively, c+ bu and b?0?. 


2.8.10. Determine the correlation coefficient of the random variables X and Y if 
var(X) = 4, var(Y) = 2, and var(X +2Y) = 15. 


2.8.11. Let X and Y be random variables with means j11, [2; variances 07, 03; and 
correlation coefficient p. Show that the correlation coefficient of W = aX+b, a> 0, 
and Z=cY +d, c>0, is p. 


2.8.12. A person rolls a die, tosses a coin, and draws a card from an ordinary 
deck. He receives $3 for each point up on the die, $10 for a head and $0 for a 
tail, and $1 for each spot on the card (jack = 11, queen = 12, king = 13). If we 
assume that the three random variables involved are independent and uniformly 
distributed, compute the mean and variance of the amount to be received. 


2.8.13. Let X, and X2 be independent random variables with nonzero variances. 
Find the correlation coefficient of Y = X;X» and X, in terms of the means and 
variances of X; and X9. 


2.8.14. Let X, and X» have a joint distribution with parameters p11, [2, 07, 03, 
and p. Find the correlation coefficient of the linear functions of Y = a,X 1 + a2X2 
and Z = 6,X; + b2X2 in terms of the real constants a1, a2, bi, b2, and the 
parameters of the distribution. 


2.8.15. Let X,,X2, and X3 be random variables with equal variances but with 
correlation coefficients pj2 = 0.3, p13 = 0.5, and p23 = 0.2. Find the correlation 
coefficient of the linear functions Y = X; + Xo and Z = Xo94+ X3. 


2.8.16. Find the variance of the sum of 10 random variables if each has variance 5 
and if each pair has correlation coefficient 0.5. 


2.8.17. Let X and Y have the parameters j1, [12, 07, 07, and p. Show that the 
correlation coefficient of X and [Y — p(o2/01)X] is zero. 


2.8.18. Let S? be the sample variance of a random sample from a distribution with 
variance o? > 0. Since E(S”) = o?, why isn’t E(S) = 0? 
Hint: Use Jensen’s inequality to show that E(S) <o. 


Chapter 3 


Some Special Distributions 


3.1 The Binomial and Related Distributions 


In Chapter 1 we introduced the uniform distribution and the hypergeometric dis- 
tribution. In this chapter we discuss some other important distributions of random 
variables frequently used in statistics. We begin with the binomial and related 
distributions. 

A Bernoulli experiment is a random experiment, the outcome of which can 
be classified in but one of two mutually exclusive and exhaustive ways, for instance, 
success or failure (e.g., female or male, life or death, nondefective or defective). 
A sequence of Bernoulli trials occurs when a Bernoulli experiment is performed 
several independent times so that the probability of success, say p, remains the same 
from trial to trial. That is, in such a sequence, we let p denote the probability of 
success on each trial. 

Let X be a random variable associated with a Bernoulli trial by defining it as 
follows: 

X (success) = 1 and X(failure) = 0. 


That is, the two outcomes, success and failure, are denoted by one and zero, respec- 
tively. The pmf of X can be written as 


pz) =p*(1—p)*,, «=0,1, (3.1.1) 
and we say that X has a Bernoulli distribution. The expected value of X is 
= E(X) = (0) —p) + (1) =p, 
and the variance of X is 
o” = var(X) = p*(1—p) + (1—p)*p = p(1 — p). 


It follows that the standard deviation of X is o = \/p(1 — p). 
In a sequence of n independent Bernoulli trials, where the probability of success 
remains constant, let X; denote the Bernoulli random variable associated with the 
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ith trial. An observed sequence of n Bernoulli trials is then an n-tuple of zeros and 
ones. In such a sequence of Bernoulli trials, we are often interested in the total 
number of successes and not in the order of their occurrence. If we let the random 
variable X equal the number of observed successes in n Bernoulli trials, the possible 
values of X are 0,1,2,...,n. If ~ successes occur, where x = 0,1,2,...,n, then n—x 
failures occur. The number of ways of selecting the x positions for the x successes 


in the n trials is 
n\ n! 
rz) al(n—2x)! 


Since the trials are independent and the probabilities of success and failure on 
each trial are, respectively, p and 1 — p, the probability of each of these ways is 
p*(1—p)"-*. Thus the pmf of X, say p(x), is the sum of the probabilities of these 
(") mutually exclusive events; that is, 


_f @rad-p)* 2=0,1,2,.:.,n 
oer 0 elsewhere. (3.1.2) 


It is clear that p(x) > 0. To verify that p(x) sums to 1 over its range, recall the 
binomial series, expression (1.3.7) of Chapter 1, which is: 


(a+s)"=)> (") Ca ame 
z=0 


for n a positive integer. Thus, 


Sve) = Yo ()era-py 


«z=0 
= [(—-p)+p)"=1. 


Therefore, p(a) satisfies the conditions of being a pmf of a random variable X of 
the discrete type. A random variable X that has a pmf of the form of p(x) is said 
to have a binomial distribution, and any such p(z) is called a binomial pmf. A 
binomial distribution is denoted by the symbol b(n, p). The constants n and p are 
called the parameters of the binomial distribution. 


Example 3.1.1 (Computation of Binomial Probabilities). Suppose we roll a fair 
six-sided die 3 times. What is the probability of getting exactly 2 sixes? For our 
notation, let X be the number of sixes obtained in the 3 rolls. Then X has a 
binomial distribution with n = 3 and p = 1/6. Hence, 


POt==70)= 6) (3) (=) * = 0.06944. 


We can do this calculation with a hand calculator. Suppose, though, we want to 
determine the probability of at least 16 sixes in 60 rolls. Let Y be the number of 
sixes in 60 rolls. Then our desired probability is given by the series 


rorzu= $0) (2) 
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which is not a simple calculation. Most statistical packages provide procedures to 
calculate binomial probabilities. In R, if Y is b(n, p) then the cdf of Y is computed 
as P(Y < y) =pbinom(y,n,p). Hence, for our example, using R we compute the 
P(Y > 16) as 


P(Y > 16) =1-— P(Y < 15) = 1-— pbinom(15, 60, 1/6) = 0.0338. 
The R function dbinom computes the pmf of a binomial distribution. For instance, 


to compute the probability that Y = 11, we use the R code: dbinom(11,60,1/6), 
which computes to 0.1246. m 


The mef of a binomial distribution is easily obtained as follows: 


ae d, eM p(x) = y ee (")era =p 


l| 


xz=0 
= nm (pe®)* (1 — aan 
> ({)er-p 
= [(1—p) +pe']” 


for all real values of t. The mean y and the variance o? 


from M(t). Since 


of X may be computed 


M'(t) = n[(1 — p) + pe]! (pe’) 
and 
M"(t) =n[(1 — p) + pe’]"“"(pe*) + n(n — 1)[(1 — p) + pe"]” 7 (pe’)’, 


if follows that 
y= M'(0) = np 


and 
o? = M"(0) — p= np + n(n — 1)p* — (np)? = np(1 — p). 


Suppose Y has the 6(60, 1/6) distribution as discussed in Example 3.1.1. Then 
E(Y) = 60(1/6) = 10 and Var(Y) = 60(1/6) (5/6) = 8.33 


Example 3.1.2. If the mgf of a random variable X is 
M(t) = (3 + 3e), 
then X has a binomial distribution with n = 5 and p = 4; that is, the pmf of X is 


_{f GGG) 2=0,1,2,...,5 
Be { 0 a elsewhere. 


Here pp = np = 3 and o? = np(1—p) = 2. a 
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Example 3.1.3. If Y is b(n, 3), then P(Y > 1) = 1— P(Y = 0) = 1-(2)”. 
Suppose that we wish to find the smallest value of n that yields P(Y > 1) > 0.80. 
We have 1 — (3)” > 0.80 and 0.20 > (3)”. Either by inspection or by use of 
logarithms, we see that n = 4 is the solution. That is, the probability of at least 
one success throughout n = 4 independent repetitions of a random experiment with 


probability of success p = - is greater than 0.80. m 


Example 3.1.4. Let the random variable Y be equal to the number of successes 
throughout n independent repetitions of a random experiment with probability p 
of success. That is, Y is b(n,p). The ratio Y/n is called the relative frequency of 
success. Recall expression (1.10.3), the second version of Chebyshev’s inequality 
(Theorem 1.10.3). Applying this result, we have for all « > 0 that 


P([f 224) aa 


[Exercise 3.1.3 asks for the determination of Var(Y/n)]. Now, for every fixed € > 0, 
the right-hand member of the preceding inequality is close to zero for sufficiently 
large n. That is, 


Y 
lim P (|Z -s] ><) =0 

n—oo n 

and . 
lim p(|E-p <<) =I, 

n— oo n 


Since this is true for every fixed € > 0, we see, in a certain sense, that the relative 
frequency of success is for large values of n, close to the probability of p of success. 
This result is one form of the Weak Law of Large Numbers. It was alluded to 
in the initial discussion of probability in Chapter 1 and is considered again, along 
with related concepts, in Chapter 5. 


Example 3.1.5. Let the independent random variables X,, X2,X3 have the same 
cdf F(a). Let Y be the middle value of X1, X2, X3. To determine the cdf of Y, say 
Fy(y) = P(Y < y), we note that Y < y if and only if at least two of the random 
variables X,,X2,X3 are less than or equal to y. Let us say that the ith “trial” 
is a success if X; < y, i = 1,2,3; here each “trial” has the probability of success 
Fy). In this terminology, Fy (y) = P(Y < y) is then the probability of at least two 
successes in three independent trials. Thus 


Fw) = (3) (FWP - Full + Pol. 
If F(x) is a continuous cdf so that the pdf of X is F’(x) = f(x), then the pdf of Y 
is 

fry) =Fy(y) = 6[F JL - F@|F). 


Suppose we have several independent binomial distributions with the same prob- 
ability of success. Then it makes sense that the sum of these random variables is 
binomial, as shown in the following theorem. 
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Theorem 3.1.1. Let X1,X2,...,Xm be independent random variables such that 
X; has binomial b(n;,p) distribution, for i =1,2,...,m. Let Y = 30", Xj. Then 
Y has a binomial b()>""., ni,p) distribution. 


Proof: The megf of X; is Mx,(t) = (1 —p+pe')™. By independence it follows from 
Theorem 2.6.1 that 


Hence, Y has a binomial b()7)"., ni,p) distribution. m 


For the remainder of this section, we discuss some important distributions that 
are related to the binomial distribution. 


3.1.1 Negative Binomial and Geometric Distributions 


Consider a sequence of independent Bernoulli trials with constant probability p of 
success. Let the random variable Y denote the total number of failures in this 
sequence before the rth success, that is, Y + r is equal to the number of trials 
necessary to produce exactly r successes with the last trial as a success. Here r 
is a fixed positive integer. To determine the pmf of Y, let y be an element of 
{y: y = 0,1,2,...}. Then, since the trials are independent, P(Y = y) is equal 
to the product of the probability of obtaining exactly r — 1 successes in the first 
y+r-—1 trials times the probability p of a success on the (y+1r)th trial. Thus the 
pmf of Y is 


0 elsewhere. 


A distribution with a pmf of the form py(y) is called a negative binomial dis- 
tribution and any such py(y) is called a negative binomial pmf. The distribution 
derives its name from the fact that py(y) is a general term in the expansion of 
p'{1—(1—p)]-”. It is left as an exercise to show that the mef of this distribution 
is M(t) = p"[1—(1—p)e']~’, for t < —log(1—p). The R call to compute P(y < y) 
is pnbinom(y,r,p). 


Example 3.1.6. Suppose the probability that a person has blood type B is 0.12. 
In order to conduct a study concerning people with blood type B, patients are 
sampled independently of one another until 10 are obtained who have blood type 
B. Determine the probability that at most 30 patients have to have their blood type 
determined. Let Y have a negative binomial distribution with p = 0.12 and r = 10. 
Then, the desired probability is 


=a (io 
PY Sn) =) (’ : Joa2"o.se 
j=0 


Its computation in R is pnbinom(20,10,0.12) = 0.0019. m 
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If r = 1, then Y has the pmf 
py(y) =p(l—p)”, y=0,1,2,..., (3.1.4) 


zero elsewhere, and the mgf M(t) = p[1 — (1 — p)e‘]~!. In this special case, r = 1, 
we say that Y has a geometric distribution. In terms of Bernoulli trials, Y is 
the number of failures until the first success. The geometric distribution was first 
discussed in Example 1.6.3 of Chapter 1. For the last example, the probability 
that exactly 11 patients have to have their blood type determined before the first 
patient with type B blood is found is given by .88'10.12. This is computed in R. by 
dgeom(11,0.12) = 0.0294. 


3.1.2 Multinomial Distribution 


The binomial distribution is generalized to the multinomial distribution as follows. 
Let a random experiment be repeated n independent times. On each repetition, 
there is one and only one outcome from one of k categories. Call the categories 
C1,Co2,...,Cy. For example, the upface of a roll of a six-sided die. Then the 
categories are C; = {i},i = 1,2,...,6. For? =1,...,k, let p; be the probability that 
the outcome is an element of C; and assume that p; remains constant throughout 
the n independent repetitions. Define the random variable X; to be equal to the 


number of outcomes that are elements of C;, 7 = 1,2,...,k —1. Because X;, = 
n— X, —---— Xp_1, Xx is determined by the other X;’s. Hence, for the joint 
distribution of interest we need only consider X1, X2,...,X%-1.- 


The joint pmf of (X1, X2,...,Xx-1) is 


n! 


P(X, = «4X2 = %2,...,Xg—-1 = LR-1) = ———— +p 4 Pa. (O48) 

: wy!+++cp_ylap! 
for all x1, %2,...,%,-1 that are nonnegative integers and such that 7; + #2 +---+ 
Cp-1 <n, where v7, = n— 41 —--+ — UR_1 and pp = 1 — pen p;. We next show 


that expression (3.1.5) is correct. The number of distinguishable arrangements of 
Ly Cis, v2 Cos, 11+, Lk Crs is 


(5) ee n! 
Ly v2 Lk—1 x1 !ax9! * + op! 


and the probability of each of these distinguishable arrangements is 


pt p%? 43 “pe, 
Hence the product of these two latter expressions gives the correct probability, which 
is in agreement with expression (3.1.5). 

We say that (X1, Xo,...,X,—-1) has a multinomial distribution with param- 
eters n and pi,...,pe—1- The joint mef of (X1, X2,...,X4-1) is M(ti,...,th-1) = 


E(exp{S7h2y tiX;i}), ie 


M(t, er 1) =>: S$ ——_—__ ra a a (pie a = Gee ie, 
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where the multiple sum is taken over all nonnegative integers and such that x2, + 
to t-+-+ap-1 <n. Let m= oa pie’ + pp_1. Recall that x, = n — =. Lj. 
Then since m > 0, we have 


M(ti,...,th-1) => m ee prea Fie el 
x (Ao ) * ee ae = -\ (-)" 
m m m 
k-1 a 
m’xil= (~ pier +n) : (3.1.6) 


i=l 


I 


where we have used the property that sum of a pmf over its support is 1. 
We can use the joint mgf to determine marginal distributions. The mgf of X; is 


M(0, os .,0,t;, 0, ied ,0) = (pie* a (1 =pyy"5 
hence, X; is binomial with parameters n and p;. The megf of (X;,X;), i < J, is 
M5 1260, ts O22 90, bi 0, vag) = (pie" + pjevi (apy 95)" 


so that (X;,X;) has a multinomial distribution with parameters n, p;, and p;. At 
times, we say that (X,, X2) has a trinomial distribution. 

Another distribution of interest is the conditional distribution of X; given X;. 
For convenience, we select i = 2 and j = 1. We know that (X,, X2) is multinomial 
with parameters n and p; and p2 and that X, is binomial with parameters n and 
pi. Thus, the conditional pmf is, 


PX, ,X(©1, £2) 
Dx a [ay = 


Px, (21) 
— nlpp [l= pir mele all 


re 1— py)*? (1 = py )r—*1—%2 


= (2) (RY (ay 
x2 1—pi 1—pi 


for 0 < a2 <n-—«2,. Note that po < 1—p,. Thus, the conditional distribution of 
X»2 given X; = 2 is binomial with parameters n — x; and p2/(1 — pi). 

Based on the conditional distribution of X2 given X1, we have E(X2|X1) = 
(n — X1)p2/(1 — pi). Let pig be the correlation coefficient between X, and Xo. 


Since the conditional mean is linear with slope —p2/(1 — pi), 72 = /npe2(1 — pa), 
and oj = \/npi(1 — pi), it follows from expression (2.5.4) that 
ps = p2 1 P1p2 
= = SS | ee 
L— pi og (1 — p1)(1 — pa) 


Because the support of X, and X2 has the constraint 7; + x22 < n, the negative 
correlation is not surprising. 
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3.1.3 Hypergeometric Distribution 


In Chapter 1, for a particular problem, we introduced the hypergeometric distribu- 
tion; see expression (1.6.4). We now formally define it. Suppose we have a lot of N 
items of which D are defective. Let X denote the number of defective items in a 
sample of size n. If the sampling is done with replacement and the items are cho- 
sen at random, then X has a binomial distribution with parameters n and D/N. 
In this case the mean and variance of X are n(D/N) and n(D/N)|(N — D)/N], 
respectively. Suppose, however, that the sampling is without replacement, which is 
often the case in practice. The pmf of X follows by noting in this case that each of 
the @) samples are equilikely and that there are ( 7 i) samples that have x 
defective items. Hence, the pmf of X is 


N-D)(D 
nal 
N ? 
Ga) 
where, as usual, a binomial coefficient is taken to be 0 when the top value is less 
than the bottom value. We say that X has a hypergeometric distribution with 
parameters (NV, D,n). 
The mean of X is 


“ m (47?) [D(D = 1)!)/[e(@ — 1)(D — 2)! 
aA) Sa)? [IN(N — DI/[N —n)in(n— D9 


x=0 


21 ae ar | airy OY 


x=1 


n—-x 


p(x) = ose 0 leer 7 (3.1.7) 


In the next-to-last step, we used the fact that the probabilities of a hypergeometric 
(N—1, D—1,n—1) distribution summed over its entire range is 1. So the means for 
both types of sampling (with and without replacement) are the same. The variances, 
though, differ. As Exercise 3.1.31 shows, the variance of a hypergeometric (N, D,n) 


is 
DN-DN-n 


N N N-1 
The last term is often thought of as the correction term when sampling without 
replacement. Note that it is close to 1 if N is much larger than n. 

The pmf (3.1.7) can be computed in R with the code dhyper(x, D, N-D, n). 
Suppose we draw 2 cards from a well shuffled standard deck of 52 cards and record 
the number of aces. The next R segment shows the probabilities over the range 
{0, 1,2} for sampling with and without replacement, respectively: 


Var(X) =n 


(3.1.8) 


rng <- 0:2; dbinom(rng,2,1/13) ; dhyper (rng ,4,48, 2) 
[1] 0.85207101 0.14201183 0.00591716 
[1] 0.850678733 0.144796380 0.004524887 


Notice how close the probabilities are. 


3.1. The Binomial and Related Distributions 163 


EXERCISES 


3.1.1. If the mgf of a random variable X is (4 + 3e')°, find P(X = 2 or 3). Verify 
using the R function dbinom. 


3.1.2. The mef of a random variable X is (3 + $e’)°. 


piu <x <n20)=3>(8) (5) (2) 


(b) Use R to compute the probability in Part (a). 
3.1.3. If X is b(n, p), show that 


e(2)=r aa #|(%-1)]-m=2 


3.1.4. Let the independent random variables X1, X2,..., X49 be iid with the com- 
mon pdf f(x) = 3x7, 0 < x < 1, zero elsewhere. Find the probability that at least 
35 of the X;’s exceed s. 


(a) Show that 


3.1.5. Over the years, the percentage of candidates passing an entrance exam to a 
prestigious law school is 20%. At one of the testing centers, a group of 50 candidates 
take the exam and 20 pass. Is this odd? Answer on the basis that X > 20 where 
X is the number that pass in a group of 50 when the probability of a pass is 0.2. 


3.1.6. Let Y be the number of successes throughout n independent repetitions of 
a random experiment with probability of success p = t+. Determine the smallest 


value of n so that P(1 < Y) > 0.70. 


3.1.7. Let the independent random variables X; and X2 have binomial distribu- 
tion with parameters ny = 3, p = 2 and ng = 4, p= 4, respectively. Compute 
P(X, = Xo). 

Hint: List the four mutually exclusive ways that X; = X2 and compute the prob- 
ability of each. 


3.1.8. For this exercise, the reader must have access to a statistical package that 
obtains the binomial distribution. Hints are given for R code, but other packages 
can be used too. 


(a) Obtain the plot of the pmf for the b(15, 0.2) distribution. Using R, the follow- 
ing commands return the plot: 


x<-0:15; plot (dbinom(x,15,.2)~x) 


(b) Repeat part (a) for the binomial distributions with n = 15 and with p = 
0.10,0.20,...,0.90. Comment on the shapes of the pmf’s as p increases. Use 
the following R segment: 


164 Some Special Distributions 


x<-0:15; par(mfrow=c(3,3)); p <- 1:9/10 
for(j in p) {plot (dbinom(x,15,j)~x); title(paste("p=",j))} 


(c) Let Y = *, where X has a b(n, 0.05) distribution. Obtain the plots of the 
pmfs of Y for n = 10, 20,50, 200. Comment on the plots (what do the plots 
seem to be converging to as n gets large’). 


3.1.9. If ¢ =r is the unique mode of a distribution that is b(n, p), show that 
(n+1)p-—l<r<(n+1)p. 


This substantiates the comments made in Part (b) of Exercise 3.1.8. 
Hint: Determine the values of x for which p(x + 1)/p(a) > 1. 


3.1.10. Suppose X is b(n, p). Then by definition the pmf is symmetric if and only 
if p(a) = p(n — a), for « = 0,...,n. Show that the pmf is symmetric if and only if 
p=1/2. 


3.1.11. Toss two nickels and three dimes at random. Make appropriate assumptions 
and compute the probability that there are more heads showing on the nickels than 
on the dimes. 


3.1.12. Let X1, X2,...,X,_1 have a multinomial distribution. 
(a) Find the mef of X2, X3,...,X-1. 
(b) What is the pmf of X2, X3,..., X,—-1? 
(c) Determine the conditional pmf of X, given that X2 = x2,...,Xp-1 = Ue-1- 
(d) What is the conditional expectation E(X1|v2,...,v%-1)? 
3.1.13. Let X be b(2,p) and let Y be b(4,p). If P(X > 1) = 3, find P(Y > 1). 


3.1.14. Let X have a binomial distribution with parameters n and p = z. Deter- 
mine the smallest integer n can be such that P(X > 1) > 0.85. 


3.1.15. Let X have the pmf p(a) = (3)(4)", x =0,1,2,3,..., zero elsewhere. Find 


the conditional pmf of X given that X > 3. 


3.1.16. One of the numbers 1,2,...,6 is to be chosen by casting an unbiased die. 
Let this random experiment be repeated five independent times. Let the random 
variable X, be the number of terminations in the set {# : x = 1,2,3} and let 
the random variable Xj be the number of terminations in the set {2 : x = 4,5}. 
Compute P(X; = 2, X2 = 1). 


3.1.17. Show that the moment generating function of the negative binomial dis- 
tribution is M(t) = p"[1 — (1 — p)e"]~". Find the mean and the variance of this 
distribution. 

Hint: In the summation representing M(t), make use of the negative binomial 
series.1 


1See, for example, Mathematical Comments referenced in the Preface. 
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3.1.18. One way of estimating the number of fish in a lake is the following capture- 
recapture sampling scheme. Suppose there are N fish in the lake where N is 
unknown. A specified number of fish T are captured, tagged, and released back to 
the lake. Then at a specified time and for a specified positive integer r, fish are 
captured until the rth tagged fish is caught. The random variable of interest is Y 
the number of nontagged fish caught. 


(a) What is the distribution of Y? Identify all parameters. 
(b) What is E(Y) and the Var(Y)? 


(c) The method of moment estimate of N is to set Y equal to the expression for 
E(Y) and solve this equation for N. Call the solution N. Determine NV. 


(d) Determine the mean and variance of N. 


3.1.19. Consider a multinomial trial with outcomes 1,2,...,& and respective prob- 
abilities p1,p2,...,px- Let ps denote the R vector for (pi, p2,...,pr). Then a single 
random trial of this multinomial is computed with the command multitrial (ps), 
where the required R functions are:? 


psum <- function(v){ 
p<-0; psum <- c() 
for(j in 1:length(v)){p<-ptv[j]; psum <- c(psum,p)} 
return (psum) } 

multitrial <- function(p){ 
pr <- c(0,psum(p)) 
rv <- runif(1); ic <- 0; j <- 1 
while(ic==0){if((r > pr[j]) && (r <= pr[j+1])) 
{multitrial <-j; ic<-1}; j<- j+1} 
return(multitrial)} 


(a) Compute 10 random trials if ps=c(.3,.2,.2,.2,.1). 


(b) Compute 10,000 random trials for ps as in (a). Check to see how close the 
estimates of p; are with p;. 


3.1.20. Using the experiment in part (a) of Exercise 3.1.19, consider a game when 
a person pays $5 to play. If the trial results in a 1 or 2, she receives nothing; if a 
3, she receives $1; if a 4, she receives $2; and if a 5, she receives $20. Let G be her 
gain. 


(a) Determine E(G). 


(b) Write R code that simulates the gain. Then simulate it 10,000 times, collecting 
the gains. Compute the average of these 10,000 gains and compare it with 
E(G). 


2Downloadable at the site listed in the Preface 
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3.1.21. Let X; and X»2 have a trinomial distribution. Differentiate the moment- 
generating function to show that their covariance is —np,po. 


3.1.22. Ifa fair coin is tossed at random five independent times, find the conditional 
probability of five heads given that there are at least four heads. 


3.1.23. Let an unbiased die be cast at random seven independent times. Compute 
the conditional probability that each side appears at least once given that side 1 
appears exactly twice. 


3.1.24. Compute the measures of skewness and kurtosis of the binomial distribution 
b(n, p). 
3.1.25. Let 
aa ry\ (1 < v2 =0,1,...,%1 
si x2 2 15/7? xr, = 1,2,3,4,5, 


zero elsewhere, be the joint pmf of X, and X2. Determine 


(a) E(X2). 
(b) u(a1) = E(X2|@1). 
(c) E[u(X)]. 
Compare the answers of parts (a) and (c). 
Hint: Note that E(X2) = 7?) 7} tap(21, 22). 


3.1.26. Three fair dice are cast. In 10 independent casts, let X be the number of 
times all three faces are alike and let Y be the number of times only two faces are 
alike. Find the joint pmf of X and Y and compute E(6XY). 


3.1.27. Let X have a geometric distribution. Show that 
P(X >k+j|X>b) =P(X > J), (3.1.9) 


where k and j are nonnegative integers. Note that we sometimes say in this situation 
that X is memoryless. 


3.1.28. Let X equal the number of independent tosses of a fair coin that are required 
to observe heads on consecutive tosses. Let u, equal the nth Fibonacci number, 
where uy, = ug = 1 and Un = Un_1 + Un_2, N = 3,4,5,.... 


(a) Show that the pmf of X is 


(b) Use the fact that 


to show that >, p(x) = 1. 
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3.1.29. Let the independent random variables X; and X2 have binomial distri- 


butions with parameters nj, pi = - and n2, po = 4, respectively. Show that 


Y = X, —X2+nz has a binomial distribution with parameters n = n1+n2, p= S. 


3.1.30. Consider a shipment of 1000 items into a factory. Suppose the factory 
can tolerate about 5% defective items. Let X be the number of defective items 
in a sample without replacement of size n = 10. Suppose the factory returns the 
shipment if X > 2. 


(a) Obtain the probability that the factory returns a shipment of items that has 
5% defective items. 


(b) Suppose the shipment has 10% defective items. Obtain the probability that 
the factory returns such a shipment. 


(c) Obtain approximations to the probabilities in parts (a) and (b) using appro- 
priate binomial distributions. 


Note: If you do not have access to a computer package with a hypergeometric com- 
mand, obtain the answer to (c) only. This is what would have been done in practice 
20 years ago. If you have access to R, then the command dhyper(x,D,N-D,n) 
returns the probability in expression (3.1.7). 


3.1.31. Show that the variance of a hypergeometric (N, D,n) distribution is given 
by expression (3.1.8). 

Hint: First obtain E[X (X — 1)] by proceeding in the same way as the derivation of 
the mean given in Section 3.1.3. 


3.2 The Poisson Distribution 


Recall that the following series expansion® holds for all real numbers z: 


loo) 
22 3 x 


z z 
Leeper =) ae 


Consider the function p(x) defined by 


AZe=A 
_ aT eS 05152354 3.91 
p(2) { 0 elsewhere, vay) 


where A > 0. Since A > 0, then p(x) > 0 and 


7 are wo” Nod 
BC ey er Be are 
x=0 x=0 x=0 


3See, for example, the discussion on Taylor series in Mathematical Comments referenced in the 
Preface. 
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that is, p(x) satisfies the conditions of being a pmf of a discrete type of random 
variable. A random variable that has a pmf of the form p(a) is said to have a 
Poisson distribution with parameter \, and any such p(a) is called a Poisson 
pmf with parameter i. 

As the following remark shows, Poisson distributions occur in many areas of 
applications. 


Remark 3.2.1. Consider a process that counts the number of certain events occur- 
ring over an interval of time; for example, the number of tornados that touch down 
in Michigan per year, the number of cars entering a parking lot between 8:00 and 
12:00 on a weekday, the number of car accidents at a busy intersection per week, 
the number of typographical errors per page of a manuscript, and the number of 
blemishes on a manufactured car door. As in the third and fourth examples, the 
occurrences need not be over time. It is convenient, though, to use the time rep- 
resentation in the following derivation. Let X; denote the number of occurrences 
of such a process over the interval (0,t]. The range of X; is the set of nonnegative 
integers {0,1,2,...}. For a nonnegative integer k and a real number ¢ > 0, denote 
the pmf of X; by P(X; = k) = g(k,t). Under the following three axioms, we next 
show that X; has a Poisson distribution. 


1. gh) = Ah + o(h), for a constant A > 0. 


2. Dy g(t, h) = off). 


3. The number of occurrences in nonoverlapping intervals are independent of one 
another. 


Here the o(h) notation means that o(h)/h > 0 as h — 0. For instance, h? = o(h) 
and o(h) + o(h) = o(h). Note that the first two axioms imply that in a small 
interval of time h, either one or no events occur and that the probability of one 
event occurring is proportional to h. 

By the method of induction, we now show that the distribution of X; is Poisson 
with parameter At. First, we obtain g(k,t) for k = 0. Note that the boundary 
condition g(0,0) = 1 is reasonable. No events occur in time (0,¢ + h] if and only if 
no events occur in (0,t] and no events occur in (t,t + h]. By Axioms (1) and (2), 
the probability that no events occur in the interval (0, h] is 1— Ah + o(h). Further, 
the intervals (0, ¢] and (t,t + h] do not overlap. Hence, by Axiom (3) we have 


g(0,t + h) = g(0,t)[1 — Ah + o(h)]. (3.2.2) 
That is, 


g(0,t + h) — g(0,t) 


7 = —Ag(0,t 


\+ HO Hot) — —Ag(0,t), ash— 0. 


Thus, g(0, t) satisfies the differential equation 


dy g(0, t) —_) 


g(0, t) 
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Integrating both side with respect to ¢, we have for some constant c that 
log g(0,t) =—At +c or g(0,t) =e*e°. 
Finally, using the boundary condition g(0,0) = 1, we have e° = 1. Hence, 
g(0,t) =e". (3.2.3) 


So the result holds for k = 0. 

For the remainder of the proof, assume that, for k a nonnegative integer, g(k,t) = 
e~*(At)*/k!. By induction, the proof follows if we can show that the result holds 
for g(k + 1,t). Another reasonable boundary condition is g(k + 1,0) = 0. Consider 
g(k+1,t+h). In order to have k + 1 occurrences in (0,¢ +h] either there are k +1 
occurrences in (0,¢] and no occurrences in (t,¢ + h] or there are k occurrences in 
(0, ¢] and one occurrence in (t,t + h]. Because these events are disjoint we have by 
the independence of Axiom 3 that 


gk+1,t+h) = g(k +1,t)[1 — AA + o(h)] + g(k, t) [Ah + o(h)], 
that is, 


g(kK+1,t+h)—g(k+1,t) 


o(h) 
h h- 


= —Aglk + 1,t) + glk, tr + [g(k + 1,4) + 9k, t)] 
Letting h — 0 and using the value of g(k,t), we obtain the differential equation 


< a(k $12) = —Ag(k+ 1,4) + Ae OG)" Pl. 


This is a linear differential equation of first order. Appealing to a theorem in 
differential equations, its solution is 


ef Mtg(k + 1,t) = fre too' ya dt +e. 


Using the boundary condition g(k + 1,0) = 0 and carrying out the integration, we 
obtain 
g(k +1, #) =e [(At) FY /(k +14) 


Therefore, X; has a Poisson distribution with parameter At. @ 


Let X have a Poisson distribution with parameter A. The mgf of X is given by 


oS as \e-> 
M(t) = a e" n(x) = S- —— 
x2=0 2=0 _ 
_ = — (re")* 
= 3 x! 
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for all real values of t. Since 


M'(t) = XD (det) 


and : : 
M"(t)= ene -l) \ehiet ete —) pe? 
then 
w= M'(0)=A 
and 


o? = M"(0) — ww = 4+A-AN=d. 


That is, a Poisson distribution has p= 07 = \ > 0. 

If X has a Poisson distribution with parameter A, then P(X = k) is computed 
by the R command dpois(k, lambda) and the cumulative probability P(X < k) is 
calculated by ppois(k, lambda). 


Example 3.2.1. Let X be the number of automobile accidents at a busy inter- 
section per week. Suppose that X has a Poisson distribution with \ = 2. Then 
the expected number of accidents per week is 2 and the standard deviation of the 
number of accidents is /2. The probability of at least one accident in a week is 


P(X > 1) =1- P(X =0) =1-—e7? = 1 — dpois(0,2) = 0.8647 
and the probability that there are between 3 and 8 (inclusive) accidents is 
P(3<.X <8) =P(X <8) — P(X < 2) = ppois(8,2) — ppois(2,2) = 0.3231. 


Suppose we want to determine the probability that there are exactly 16 accidents 
in a 4 week period. By Remark 3.2.1, the number of accidents over a 4 week period 
has a Poisson distribution with parameter 2 x 4 = 8. So the desired probability is 
dpois(16,8) = 0.0045. The following R code computes a spiked plot of the pmf 
of X over {0,1,...,7}, a subset of the range of X. 


rng=0:7; y=dpois(rng,2); plot(y~rng,type="h", ylab="pmf" ,xlab="Rng") ; 
points (y~rng, pch=16, cex=2) 


Example 3.2.2. Let the probability of exactly one blemish in 1 foot of wire be 
1 


about y99 and let the probability of two or more blemishes in that length be, for 
all practical purposes, zero. Let the random variable X be the number of blemishes 
in 3000 feet of wire. If we assume the independence of the number of blemishes in 
nonoverlapping intervals, then by Remark 3.2.1 the postulates of the Poisson process 
are approximated, with A = a and t = 3000. Thus X has an approximate Poisson 
distribution with mean 3000(;55) = 3. For example, the probability that there are 
five or more blemishes in 3000 feet of wire is 

oO 3ke-3 

P(X >5)=)> a7 = 1 = ppois(4,3) = 0.1847. m 
k=5 
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The Poisson distribution satisfies the following important additive property. 


Theorem 3.2.1. Suppose X1,...,X, are independent random variables and sup- 
pose X; has a Poisson distribution with parameter \;. Then Y = S\"_, Xj has a 
Poisson distribution with parameter \>;"_, Xi. 


Proof: We obtain the result by determining the mgf of Y, which by Theorem 2.6.1 
is given by 


My(t)=E (e”) = Ter = ply le =I). 
i=l 


By the uniqueness of mgfs, we conclude that Y has a Poisson distribution with 
parameter )>;", Ai. 


Example 3.2.3 (Example 3.2.2, Continued). Suppose, as in Example 3.2.2, that 
a bail of wire consists of 3000 feet. Based on the information in the example, we 
expect three blemishes in a bail of wire, and the probability of five or more blemishes 
is 0.1847. Suppose in a sampling plan, three bails of wire are selected at random and 
we compute the mean number of blemishes in the wire. Now suppose we want to 
determine the probability that the mean of the three observations has five or more 
blemishes. Let X; be the number of blemishes in the ith bail of wire for 7 = 1, 2,3. 
Then X; has a Poisson distribution with parameter 3. The mean of X,, X2, and X3 
ig X = 3-4 aa X;, which can also be expressed as Y/3, where Y = ean X;. By 
the last theorem, because the bails are independent of one another, Y has a Poisson 
distribution with parameter a 3 = 9. Hence, the desired probability is 


P(X > 5) = P(Y > 15) =1 — ppois(14,9) = 0.0415. 


Hence, while it is not too odd that a bail has five or more blemishes (probability 
is 0.1847), it is unusual (probability is 0.0415) that three independent bails of wire 
average five or more blemishes. m 


EXERCISES 


3.2.1. If the random variable X has a Poisson distribution such that P(X = 1) = 
P(X =2), find P(X =4). 


3.2.2. The mgf of a random variable X is et’) Show that P(u-20 < X < 
jo + 2c) = 0.931. 


3.2.3. In a lengthy manuscript, it is discovered that only 13.5 percent of the pages 
contain no typing errors. If we assume that the number of errors per page is a 
random variable with a Poisson distribution, find the percentage of pages that have 
exactly one error. 


3.2.4. Let the pmf p(a) be positive on and only on the nonnegative integers. Given 
that p(a) = (4/x”)p(@ — 1), « = 1,2,3,..., find the formula for p(x). 

Hint: Note that p(1) = 4p(0), p(2) = (47/2!)p(0), and so on. That is, find each 
p(a) in terms of p(0) and then determine p(0) from 


1 = p(0) + p(1) + p(2) +e 
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3.2.5. Let X have a Poisson distribution with ps = 100. Use Chebyshev’s inequality 
to determine a lower bound for P(75 < X < 125). Next, calculate the probability 
using R. Is the approximation by Chebyshev’s inequality accurate? 


3.2.6. The following R code segment computes a page of plots for Poisson pmfs 
with means 2,4,6,...,18. Run this code and comment on the the shapes and modes 
of the distributions. 

par (mfrow=c(3,3)); x= 0:35; lam=seq(2,18,2); 

for(y in lam) {plot (dpois(x,y)~x); title(paste("Mean is ",y))} 


3.2.7. By Exercise 3.2.6 it seems that the Poisson pmf peaks at its mean A. Show 
that this is the case by solving the inequalities [p(x + 1)/p(x)] > 1 and [p(a@ + 
1)/p(x)] < 1, where p(2) is the pmf of a Poisson distribution with parameter A. 


3.2.8. Using the computer, obtain an overlay plot of the pmfs of the following two 
distributions: 


(a) Poisson distribution with A = 2. 
(b) Binomial distribution with n = 100 and p = 0.02. 
Why would these distributions be approximately the same? Discuss. 


3.2.9. Continuing with Exercise 3.2.8, make a page of four overlay plots for the 
following 4 Poisson and binomial combinations: \ = 2,p = 0.02; A = 10,p = 0.10; 
A = 30, p = 0.30; A = 50, p = 0.50. Use n = 100 in each situation. Plot the subset of 
the binomial range that is between np + \/np(1 — p). For each situation, comment 
on the goodness of the Poisson approximation to the binomial. 


3.2.10. The approximation discussed in Exercise 3.2.8 can be made precise in the 
following way. Suppose X,, is binomial with the parameters n and p = A/n, for a 
given \ > 0. Let Y be Poisson with mean A. Show that P(X, =k) — P(Y =k), 
as n — oo, for an arbitrary but fixed value of k. 

Hint: First show that: 


en ee nin — 1) — +1) (1-4) 7] (.-4)". 


k! nk n n 


3.2.11. Let the number of chocolate chips in a certain type of cookie have a Poisson 
distribution. We want the probability that a cookie of this type contains at least 
two chocolate chips to be greater than 0.99. Find the smallest value of the mean 
that the distribution can take. 


3.2.12. Compute the measures of skewness and kurtosis of the Poisson distribution 
with mean pu. 


3.2.13. On the average, a grocer sells three of a certain article per week. How 
many of these should he have in stock so that the chance of his running out within 
a week is less than 0.01? Assume a Poisson distribution. 
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3.2.14. Let X have a Poisson distribution. If P(X = 1) = P(X = 3), find the 
mode of the distribution. 


3.2.15. Let X have a Poisson distribution with mean 1. Compute, if it exists, the 
expected value E(X!). 


3.2.16. Let X and Y have the joint pmf p(x, y) = e~?/[x!(y—)!], y=0,1,2,..., 
x =0,1,...,y, zero elsewhere. 


(a) Find the mgf M(t, t2) of this joint distribution. 


(b) Compute the means, the variances, and the correlation coefficient of X and 
Ye 


(c) Determine the conditional mean E(X|y). 
Hint: Note that 
y 
> lexp(ti2)]y!/[x\(y — #)!] = [1 + exp(ta)]". 
x=0 
Why? 


3.2.17. Let X; and X2 be two independent random variables. Suppose that X, and 
Y = X,+ Xe have Poisson distributions with means js; and ps > pu, respectively. 
Find the distribution of Xo. 


3.3 TheT, y?, and @ Distributions 


In this section we introduce the continuous gamma [-distribution and several as- 
sociated distributions. The support for the [’-distribution is the set of positive real 
numbers. This distribution and its associated distributions are rich in applications 
in all areas of science and business. These applications include their use in modeling 
lifetimes, failure times, service times, and waiting times. 

The definition of the [-distribution requires the [I function from calculus. It is 
proved in calculus that the integral 


ee) 
| yo te? dy 
0 


exists for a > 0 and that the value of the integral is a positive number. The integral 
is called the gamma function of a, and we write 


ria) = [ y~ ‘e 4 dy. 
0 


If a = 1, clearly 


T(1) =| e Ydy=1. 
0 
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If a > 1, an integration by parts shows that 


T'(a) = (a— yf y* *e-¥ dy = (a— 1)T(a — 1). (3.3.1) 
0 
Accordingly, if @ is a positive integer greater than 1, 


P(a) = (a— 1)(a— 2)-+- (3)(2)Q)P) = (a — I). 
Since ['(1) = 1, this suggests we take 0! = 1, as we have done. The I function is 
sometimes called the factorial function. 
We say that the continuous random variable X has a [-distribution with pa- 
rameters a > 0 and (3 > 0, if its pdf is 


elsewhere. 


a-1 e72/B 
f(x) = e rare —— (3.3.2) 


In which case, we often write that X has I'(a, 3) distribution. 

To verify that f(x) is a pdf, note first that f(x) > 0, for all « > 0. To show 
that it integrates to 1 over its support, we use the change-of-variable z = 2/{, 
dz = (1/8)da in the following derivation: 


ene ae a-1,—«/B = - > a—1,-—z 
| Taya” e€ dx = rox (Bz)° “e *Bdz 


1 a als 
= Tra)ge? T'(a) = 1; 


hence, f(x) is a pdf. This change-of-variable used is worth remembering. We use a 
similar change-of-variable in the following derivation of X’s mgf: 


a 1 
| ef® glen #/B dz 
0 


M(t) 


T'(a) ee 
1 a-1,-2(1-6t)/B 
= ———“"e dx. 
fran 
Next, use the change-of-variable y = x(1 — Gt)/G, t < 1/6, or « = By/(1 — Bt), to 
obtain a 
gs!) : = a By = 
Bo \1— Bt 
That is, 
1 yey 
M(t) = — — 4 "e @ 
o = (a) [meee 
1 es 1 
(1 — Bt)’ ce 
Now 
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and 


M"(t) = (—a)(—a — 1)(1 — Bt) ?(—8)?. 


Hence, for a gamma distribution, we have 
y= M"(0) = 08 


and 
o* = M"(0) — p? = o(a + 1)8? — 07 8? = af”. 


Suppose X has a I'(a, 3) distribution. To calculate probabilities for this distri- 
bution in R, let a = a and b = @. Then the command pgamma(x, shape=a, scale=b) 
returns P(X < x), while the value of the pdf of X at x is returned by the command 
dgamma(x,shape=a,scale=b). 


Example 3.3.1. Let X be the lifetime in hours of a certain battery used under 
extremely cold conditions. Suppose X has a I'(5, 4) distribution. Then the mean life- 
time of the battery is 20 hours with standard deviation /5 x 16 = 8.94 hours. The 
probability that battery lasts at least 50 hours is 1-pgamma(50, shape=5, scale=4) 
= 0.0053. The median lifetime of the battery is qgamma(.5,shape=5,scale=4) 
= 18.68 hours. The probability that the lifetime is within one standard deviation 
of its mean lifetime is 


pgamma(20+8.94, shape=5 ,scale=4) -pgamma (20-8 .94, shape=5,scale=4)=.700. 
Finally, this line of R code presents a plot of the pdf: 
x=seq(.1,50,.1); plot (dgamma(x, shape=5,scale=4)~x). 


On this plot, the reader should locate the above probabilities and the mean and 
median lifetimes of the battery. m 


The main reason for the appeal of the I’-distribution in applications is the variety 
of shapes of the distribution for different values of a and (@. This is apparent in 
Figure 3.3.1 which depicts six I-pdfs.4 

Suppose X denotes the failure time of a device with pdf f(a) and cdf F(a). In 
practice, the pdf of X is often unknown. If a large sample of failure times of these 
devices is at hand then estimates of the pdf can be obtained as discussed in Chapter 
4. Another function that helps in identifying the pdf of X is the hazard function 
of X. Let x be in the support of X. Suppose the device has not failed at time z, 
i.e., X > a. What is the probability that the device fails in the next instance? We 
answer this question in terms of the rate of failure at x, which is: 


= Fi Pi2a<X<a+A|X>2) | 1 i P(a<X <a+A) 
wr = RO A ~ T— F(a) Ao A 

_ _ f(z) 

= re (3.3.3) 


The rate of failure at time x, r(x), is defined as the hazard function of X at x. 


4The R function for these plots is newfigc3s3.1.R, at the site listed in the Preface. 
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0.12 


f(x) 
0.06 


0.00 


Figure 3.3.1: Several gamma densities 


Note that the hazard function r(x) satisfies —(d/dx) log|1 — F(x)]; that is, 
1 — F(x) = eS r(@) dete. (3.3.4) 


for a constant c. When the support of X is (0,00), F'(0) = 0 serves as a boundary 
condition to solve for c. In practice, often the scientist can describe the hazard rate 
and, hence, F'(x) can be determined from expression (3.3.4). For example, suppose 
the hazard rate of X is constant; i.e, r(z) = 1/6 for some @ > 0. Then 


1— F(x) =e SO/B) dete — 6 #/Bee, 


Since F'(0) = 0, e* = 1. So the pdf of X is 


i —a2/B 0 
=2-2° xz> 
fe) { 0 elsewhere. (3.3.5) 


Of course, this is a [(1, 3) distribution, but it is also called the exponential dis- 
tribution with parameter 1/3. An important property of this distribution is given 
in Exercise 3.3.25. 

Using R, hazard functions can be quickly plotted. Here is the code for an 
overlay plot of the hazard functions of the exponential distribution with 6 = 8 and 
the ['(4, 2)-distribution. 


x=seq(.1,15,.1); t=dgamma(x, shape=4,scale=2) 
b=(1-pgamma(x,shape=4,scale=2)) ;y1l=t/b;plot (y1~x) ;abline(h=1/8) 


Note that the hazard function of this [-distribution is an increasing function of 
x; i.e., the rate of failure increases as time progresses. Other examples of hazard 
functions are given in Exercise 3.3.26. 
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One of the most important properties of the gamma distribution is its additive 
property. 
Theorem 3.3.1. Let X1,...,Xn be independent random variables. Suppose, for 


i=1,...,n, that X; has aT (a;, 8) distribution. Let Y = ar X;. Then Y has a 
P(e, a, B) distribution. 


Proof: Using the assumed independence and the mgf of a gamma distribution, we 
have by Theorem 2.6.1 that for t < 1/{, 


My (t) = [[@ - 6t)-* = (a - Bt) Ba, 
i=l 
which is the mgf of a T()7"_, a;, 3) distribution. m 
T-distributions naturally occur in the Poisson process, also. 


Remark 3.3.1 (Poisson Processes). For t > 0, let X; denote the number of events 
of interest that occur in the interval (0,t]. Assume X; satisfies the three assumptions 
of a Poisson process. Let k be a fixed positive integer and define the continuous 
random variable W;, to be the waiting time until the kth event occurs. Then the 
range of W;, is (0,co). Note that for w > 0, W, > w if and only if X, <k-1. 
Hence, 


k-1 k-1 


Pm Sy Peewee 


(Aw)* ew 
—— 
a=0 x=0 a 


In Exercise 3.3.5, the reader is asked to prove that 
oO: ,k=1 2=2 k-1 xz ,—Aw 
i we ae 3 (Aw)e 
Aw (k — 1)! 2=0 x! 
Accepting this result, we have, for w > 0, that the cdf of W, satisfies 


ee) gh-le-z Aw gh-le-z 
Fy, (w) =1- ———_dz= ———_ dz, 
w, (w) | Te) [ Th 


Ww 


and for w < 0, Fy, (w) = 0. If we change the variable of integration in the integral 
that defines Fy, (w) by writing z = Ay, then 


w \kyk-1e—dy 
Fw, (w = —_§_— dy, w>0), 


and Fyw,(w) =0 for w <0. Accordingly, the pdf of W;, is 


NM wk-le- Aw 


fo) = Fi) = | 


0<w<o 
elsewhere. 
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That is, the waiting time until the kth event, W;, has the gamma distribution with 
a=kand @ =1/X. Let T, be the waiting time until the first event occurs, i-e., 
k = 1. Then the pdf of T{ is 


rer 0<w<o 
fr,(w) = { 0 elsewhere. (3.3.6) 


Hence, T, has the I'(1,1/A)-distribution. The mean of T; = 1/A, while the mean 
of X, is A. Thus, we expect A events to occur in a unit of time and we expect the 
first event to occur at time 1/). 

Continuing in this way, for i > 1, let T; denote the interarrival time of the ith 
event; i.e., T; is the time between the occurrence of event (¢ — 1) and event 7. As 
shown T, has the ['(1,1/A). Note that Axioms (1) and (2) of the Poisson process 
only depend on X and the length of the interval; in particular, they do not depend 
on the endpoints of the interval. Further, occurrences in nonoverlapping intervals 
are independent of one another. Hence, using the same reasoning as above, Tj, 
j > 2, also has the ['(1,1/A)-distribution. Furthermore, T),7T2,73,... are inde- 
pendent. Note the waiting time until the kth event satisfies W, = 7, +--:+ Th. 
Thus by Theorem 3.3.1, W, has a [(k,1/A) distribution, confirming the derivation 
above. Although this discussion has been intuitive, it can be made rigorous; see, 
for example, Parzen (1962). m 


3.3.1 The y\?-Distribution 


Let us now consider a special case of the gamma distribution in which a = 1/2, 
where r is a positive integer, and 3 = 2. A random variable X of the continuous 
type that has the pdf 

PERN EIS ee es 


3.3.7 
elsewhere, ( ) 


1 
f(x) = { Soar 


and the mef 
M(t) =(1-2t)-"?, t <4, 


is said to have a chi-square distribution (x?-distribution), and any f(z) of this 
form is called a chi-square pdf. The mean and the variance of a chi-square dis- 
tribution are u = aG = (r/2)2 =r and o? = af? = (r/2)2? = 2r, respectively. We 
call the parameter r the number of degrees of freedom of the chi-square distribution 
(or of the chi-square pdf). Because the chi-square distribution has an important role 
in statistics and occurs so frequently, we write, for brevity, that X is y?(r) to mean 
that the random variable X has a chi-square distribution with r degrees of freedom. 
The R function pchisq(x,r) returns P(X < x) and the command dchisq(x,r) 
returns the value of the pdf of X at « when X has a chi-squared distribution with 
r degrees of freedom. 


Example 3.3.2. Suppose X has a y?-distribution with 10 degrees of freedom. Then 
the mean of X is 10 and its standard deviation is V20 = 4.47. Using R, its median 
and quartiles are qchisq(c(.25,.5,.75) ,10)= (6.74, 9.34, 12.55). The following 
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command plots the density function over the interval (0, 24): 
x=seq(0,24,.1);plot(dchisq(x,10)~x). 

Compute this line of code and locate the mean, quartiles, and median of X on the 

plot. @ 


Example 3.3.3. The quantiles of the y?-distribution are frequently used in statis- 
tics. Before the advent of modern computation, tables of these quantiles were com- 
piled. Table I in Appendix D offers a typical y?-table of the quantiles for the proba- 
bilities 0.01, 0.025, 0.05, 0.1, 0.9, 0.95, 0.975, 0.99 and degrees of freedom 1, 2,..., 30. 
As discussed, the R function qchisq easily computes these quantiles. Actually, the 
following two lines of R code performs the computation of Table I. 


rs=1:30; ps=c(.01,.025,.05,.1,.9,.95, .975, .99); 
for(r in rs){print(c(r,round(qchisq(ps,r) ,digits=3)))} 


Note that the code rounds the critical values to 3 places. m 


The following result is used several times in the sequel; hence, we record it as a 
theorem. 


Theorem 3.3.2. Let X have a x?(r) distribution. If k > —r/2, then E(X*) exists 
and it is given by 
QT (+k) 


cere 


if k > —r/2. (3.3.8) 


Proof: Note that 
oa 1 
EB xk =} i (7/2) +k-1 2/2 dx. 
ty (5) 2°? 
Make the change of variable u = 2/2 in the above integral. This results in 
a 1 
ky (r/2)+k-1, (r/2)+k-1,—u 
E(X =f weighs 6 5/a1 2 u e "du. 


This simplifies to the desired result provided that k > —(r/2). m 


Notice that if & is a nonnegative integer, then k > —(r/2) is always true. Hence, 
all moments of a y? distribution exist and the kth moment is given by (3.3.8). 


Example 3.3.4. Let X have a gamma distribution with a = r/2, where r is a 
positive integer, and @ > 0. Define the random variable Y = 2X/3. We seek the 
pdf of Y. Now the megf of Y is 


My (t) = & (e* ) =f [_t2e/e)~| 
2t a r/2 

= = 2 
= f B 5| [1 _ t] ; 


which is the mef of a y?-distribution with r degrees of freedom. That is, Y is y?(r). 
| 
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Because the x?-distributions are a subfamily of the I-distributions, the additiv- 
ity property for I-distributions given by Theorem 3.3.1 holds for y?-distributions, 
also. Since we often make use of this property, we state it as a corollary for easy 
reference. 


Corollary 3.3.1. Let X1,...,Xn be independent random variables. Suppose, for 
i=1,...,n, that X; has a x?(ri) distribution. Let Y = 07, X;. Then Y has a 
7(oi, ri) distribution. 


3.3.2 The (-Distribution 


As we have discussed, in terms of modeling, the I-distributions offer a wide vari- 
ety of shapes for skewed distributions with support (0,00). In the exercises and 
later chapters, we offer other such families of distributions. How about continuous 
distributions whose support is a bounded interval in R? For example suppose the 
support of X is (a,b) where —co < a < b < o0 and a and bare known. Without loss 
of generality, for discussion, we can assume that a = 0 and b = 1, since, if not, we 
could consider the random variable Y = (X — a)/(b—a). In this section, we discuss 
the G@-distribution whose family offers a wide variety of shapes for distributions 
with support on bounded intervals. 

One way of defining the (-family of distributions is to derive it from a pair 
of independent T random variables. Let X, and X2 be two independent random 
variables that have I distributions and the joint pdf 


1 
P(a)l(B) 
zero elsewhere, where a > 0, G > 0. Let ¥; = X1 + X2 and Yo = X1/(X1 + X92). 
We next show that Y; and Y> are independent. 


The space S is, exclusive of the points on the coordinate axes, the first quadrant 
of the x1, x2-plane. Now 


a-1, 8-1, -—a1—2x2 


h(a1, 22) = a a 2 , 0< 42, < 0, 0<249<~, 


Yr = u1(%1,%2) = 2714+ xe 


(1,22) = —> 
= U9(@1, 22) = 
Y2 2\t1,%2 aes 
may be written 71 = y1y2, 2 = yi(1 — y2), so 
Y2 Y1 
J= =— 0. 
| 1— yo —Y1 Y1 x 


The transformation is one-to-one, and it maps S onto T = {(y1,y2):0< y1 < 
00, 0 < yo < 1} in the yi y2-plane. The joint pdf of Y; and Y2 on its support is 


g(y1,y2) = (y) (yry2)°—* [yi (1 — yo)|P-te-™ 


a-l a B-1 = 
Ys ac ey yotB Io-m Qe ee, Oa 
elsewhere. 
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In accordance with Theorem 2.4.1 the random variables are independent. The 
marginal pdf of Y is 


a-1 = B-1 foe) 
yo (1—y2) / ae = ee 
galy ee e dy 
aM) rare) Jy 
T(at+8) )a-1 = 
— ) rary? (L— we)?" O<m <1 (3.3.9) 
0 elsewhere. 


This pdf is that of the beta distribution with parameters a and (3. Since g(y1, y2) = 
gi(y1)g2(y2), it must be that the pdf of Yj is 
atB—-1o—y 


1 
sJ tem © U<a< 
au(yi) 0 elsewhere, 


which is that of a gamma distribution with parameter values of a + ( and 1. 

It is an easy exercise to show that the mean and the variance of Y2, which has 
a beta distribution with parameters a and {, are, respectively, 

a 2 ap 
= — See 
a+ (a+ 8+1)(a+ 8) 

The package R calculates probabilities for the beta distribution. If X has a beta 
distribution with parameters a = a and ( = b, then the command pbeta(x,a,b) 
returns P(X < «) and the command dbeta(x,a,b) returns the value of the pdf of 
X at x. 


Example 3.3.5 (Shapes of 3-Distributions). The following 3 lines of R code®, will 
obtain a 4 x 4 page of plots of @ pdfs for all combinations of integer values of a and 
G between 2 and 5. Those distributions on the main diagonal of the page of plots 
are symmetric, those below the main diagonal are left-skewed, and those above the 
main diagonal are right-skewed. 


par (mfrow=c(4,4)) ;r1=2:5; r2=2:5;x=seq(.01,.99, .01) 
for(a in r1){for(b in r2) {plot (dbeta(x,a,b)~x); 
title(paste("alpha = ",a,"beta = ",b))}} 


Note that if a = 6 = 1, then the G-distribution is the uniform distribution with 
support (0,1). 

We close this section with another example of a random variable whose distri- 
bution is derived from a transformation of gamma random variables. 


Example 3.3.6 (Dirichlet Distribution). Let X1, X2,..., X%41 be independent ran- 
dom variables, each having a gamma distribution with 6 = 1. The joint pdf of these 
variables may be written as 

k+1 1 a,—-l a, 


i= ue 0< 2x; <0 
h(a1,@2,---,€e41) = Iisa T(ai) 4 a 
0 elsewhere. 


5 Download the R function betaplts at the site listed in the Preface. 
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Let 
X; 


7 Xi t+ Xot+--++Xega’ 
and Yz4, = X1+X2+---+Xp41 denote k+1 new random variables. The associated 


transformation maps A = {(x1,...,Uk41):0< aj < oo, t=1,...,4+1} onto the 
space: 


i=1,2,..., 


a k, 


B={(y1,---, Yk, Yeo) 10< Yi, t= 1jsiay Ky yitor + yn <1, 0 < YR41 < co}. 


The single-valued inverse functions are 21 = YiYr+1,---;Uk = YRYk+1,UR+1 = 
Yr+i(1 — y1 — +++ — Ye), 80 that the Jacobian is 
Yk+1 0 oe 0 v1 
0 Yr+l ott 0 yo 
. i Fs * k 
i= ; : : : = Vr+i: 
0 0 sot Yk+1 Yk 
=Yeti —YRir --* —Yeer (l—yi —-+-— ye) 


Hence the joint pdf of Y1,...,Yx, Ye41 is given by 


arte +Ar41—-1, ay—-1, 


i yor td ..yOR1 (1 — yy — os — yp tet le vet 


Day) --- Tax) (an+41) 


provided that (yi,..-,Yk,Ye+1) € B and is equal to zero elsewhere. By integrating 
out yx41, the joint pdf of Y1,...,Y, is seen to be 


Tay apes Qk+1) a,—1 ag—1 


Uk) = ves 1—yy—+ + -— yp) 41-1, (3.3.10 
Ory ste) Toi) FG) gy,” Un Yk) » ( ) 


when 0 < y;, t= 1,...,k, yp +--+ + yx <1, while the function g is equal to zero 
elsewhere. Random variables Yj,..., Y; that have a joint pdf of this form are said to 
have a Dirichlet pdf. It is seen, in the special case of k = 1, that the Dirichlet pdf 
becomes a beta pdf. Moreover, it is also clear from the joint pdf of Y1,...,¥%, Yrai 
that Y;41 has a gamma distribution with parameters a, +---+az,+az41 and @=1 
and that Y,41 is independent of Y;, Y2,...,Y,. 


EXERCISES 
3.3.1. Suppose (1 — 2t)~°, t < $ is the mgf of the random variable X. 
(a) Use R to compute P(X < 5.23). 
(b) Find the mean py and variance o? of X. Use R to compute P(|X — ju| < 20). 


3.3.2. If X is x?(5), determine the constants c and d so that P(c < X < d) =0.95 
and P(X <c) = 0.025. 
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3.3.3. Suppose the lifetime in months of an engine, working under hazardous con- 
ditions, has a I distribution with a mean of 10 months and a variance of 20 months 
squared. 


(a) Determine the median lifetime of an engine. 


(b) Suppose such an engine is termed successful if its lifetime exceeds 15 months. 
In a sample of 10 engines, determine the probability of at least 3 successful 
engines. 


3.3.4. Let X be a random variable such that E(X™) = (m+1)!2™, m= 1,2,3,.... 
Determine the mgf and the distribution of X. 

Hint: Write out the Taylor series® of the mef. 

3.3.5. Show that 


k-1 


ON ee pre # 
*dz= k=1,2,3,.... 
[ome e =D 12s 


x=0 


This demonstrates the relationship between the cdfs of the gamma and Poisson 
distributions. 
Hint: Either integrate by parts k—1 times or obtain the “antiderivative” by showing 


that 
k-1 


ae fae gk-j-1 = gk-le-2. 
j=0 ( ha 


3.3.6. Let X1, X2, and X3 be iid random variables, each with pdf f(x) = e~*, 
0 < x < ow, zero elsewhere. 


(a) Find the distribution of Y = minimum(X), X2, X3). 
Hint: P(Y <y)=1—P(Y > y) =1- P(X; > y,i=1,2,3). 


(b) Find the distribution of Y = maximum(X1, X2, X3). 


3.3.7. Let X have a gamma distribution with pdf 
al —2/B 
Pit) gate ; <2 <'00; 


zero elsewhere. If x = 2 is the unique mode of the distribution, find the parameter 
Band P(X < 9.49). 


3.3.8. Compute the measures of skewness and kurtosis of a gamma distribution 
that has parameters a and (3. 


3.3.9. Let X have a gamma distribution with parameters a and @. Show that 
P(X > 2a) < (2/e)*. 
Hint: Use the result of Exercise 1.10.5. 


6See, for example, the discussion on Taylor series in Mathematical Comments referenced in the 
Preface. 
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3.3.10. Give a reasonable definition of a chi-square distribution with zero degrees 
of freedom. 
Hint: Work with the mef of a distribution that is x?(r) and let r = 0. 


3.3.11. Using the computer, obtain plots of the pdfs of chi-squared distributions 
with degrees of freedom r = 1,2,5,10,20. Comment on the plots. 


3.3.12. Using the computer, plot the cdf of a I'(5, 4) distribution and use it to guess 
the median. Confirm it with a computer command that returns the median [In R, 
use the command qgamma(.5,shape=5,scale=4)]. 


3.3.13. Using the computer, obtain plots of beta pdfs for a = 1,5,10 and 6 = 
1, 2,5, 10, 20. 


3.3.14. In a warehouse of parts for a large mill, the average time between requests 
for parts is about 10 minutes. 


(a) Find the probability that in an hour there will be at least 10 requests for 
parts. 


(b) Find the probability that the 10th request in the morning requires at least 2 
hours of waiting time. 


3.3.15. Let X have a Poisson distribution with parameter m. If m is an experi- 
mental value of a random variable having a gamma distribution with a = 2 and 
3 =1, compute P(X = 0,1, 2). 

Hint: Find an expression that represents the joint distribution of X and m. Then 
integrate out m to find the marginal distribution of X. 


3.3.16. Let X have the uniform distribution with pdf f(a) = 1, 0 < a < 1, zero 
elsewhere. Find the cdf of Y = —2log X. What is the pdf of Y? 


3.3.17. Find the uniform distribution of the continuous type on the interval (0, c) 
that has the same mean and the same variance as those of a chi-square distribution 
with 8 degrees of freedom. That is, find b and c. 


3.3.18. Find the mean and variance of the (@ distribution. 
Hint: From the pdf, we know that 


: a-1 = 2 T(a)P() 
| y**(1—y)?* dy = Tats) 


for alla >0, 6 > 0. 


3.3.19. Determine the constant c in each of the following so that each f(a) is a G 
pdf: 


(a) f(x) 
(b) f(x) 
(c) f(a) 


l| 


ex(1 — x), 0 < x <1, zero elsewhere. 
4 


cx*(1—2x)°, 0 < x <1, zero elsewhere. 


cx?(1 — x)§, 0 < x <1, zero elsewhere. 


l| 
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3.3.20. Determine the constant c so that f(x) = cx(3 — x)*, 0 < x < 3, zero 
elsewhere, is a pdf. 


3.3.21. Show that the graph of the @ pdf is symmetric about the vertical line 
through v = $ ifa =f. 


3.3.22. Show, for k = 1,2,...,n, that 


x=0 


This demonstrates the relationship between the cdfs of the @ and binomial distri- 
butions. 


3.3.23. Let X, and X2 be independent random variables. Let X; and Y = X;+ X92 
have chi-square distributions with r; and r degrees of freedom, respectively. Here 
ry <r. Show that X2 has a chi-square distribution with r — r; degrees of freedom. 
Hint: Write M(t) = E(e“+*2)) and make use of the independence of X and 
X2. 


3.3.24. Let X 1, X2 be two independent random variables having gamma distribu- 
tions with parameters a; = 3, 3; = 3 and ag = 5, (G2 = 1, respectively. 

(a) Find the mgf of Y = 2X1 + 6X9. 

(b) What is the distribution of Y? 
3.3.25. Let X have an exponential distribution. 

(a) For « > 0 and y > 0, show that 


P(X >a+y|X>2)=P(X>y). (3.3.11) 


Hence, the exponential distribution has the memoryless property. Recall 
from Exercise 3.1.9 that the discrete geometric distribution has a similar prop- 
erty. 


(b) Let F(x) be the cdf of a continuous random variable Y. Assume that F'(0) = 0 
and 0 < F(y) < 1 for y > 0. Suppose property (3.3.11) holds for Y. Show 
that Fy(y) =1—e- for y > 0. 


Hint: Show that g(y) = 1 — Fy (y) satisfies the equation 
g(y + 2) = g(y)g(2), 


3.3.26. Let X denote time until failure of a device and let r(a) denote the hazard 
function of X. 


(a) Ifr(x) = cx’; where cand b are positive constants, show that X has a Weibull 
distribution; i.e., 


b+1 


r= | cx! exp {$5 } 0<2<0o 


0 elsewhere. 


(3.3.12) 
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(b) If r(x) = ce”, where ¢ and 6 are positive constants, show that X has a 
Gompertz cdf given by 


_ f l1-exp{€(1—e)} 0<a<co 
Ba { 0 elsewhere. ers 
This is frequently used by actuaries as a distribution of the length of human 


life. 
(c) If r(x) = ba, linear hazard rate, show that the pdf of X is 


fo)=| 


This pdf is called the Rayleigh pdf. 


—ba? /2 
i ee (3.3.14) 


elsewhere. 


3.3.27. Write an R function that returns the value f(x) for a specified « when 
f(a) is the Weibull pdf given in expression (3.3.12). Next write an R function that 
returns the associated hazard function r(a). Obtain side-by-side plots of the pdf 
and hazard function for the three cases: c = 5 and b = 0.5; c= 5 and b = 1.0; and 
c=5and b=1.5. 


3.3.28. In Example 3.3.5, a page of plots of 3 pdfs was discussed. All of these pdfs 
are mound shaped. Obtain a page of plots for all combinations of a and 2 drawn 
from the set {.25,.75, 1, 1.25}. Comment on these shapes. 


3.3.29. Let Yi,..., Y, have a Dirichlet distribution with parameters a1,...,%, @%-41- 


(a) Show that Y; has a beta distribution with parameters a = a; and B = ap + 
i Ak41- 


(b) Show that Y; +---4+ Y,, r < k, has a beta distribution with parameters 
a=ayt+-::-+a, and B= apy, +++ + OK41- 


(c) Show that Yj + Yo, Y3+ Ya, Ys,...,¥¢, & > 5, have a Dirichlet distribution 
with parameters a, + a2, a3 + Q4, A5,.--,Qk, Uk41- 
Hint: Recall the definition of Y; in Example 3.3.6 and use the fact that the 
sum of several independent gamma variables with 3 = 1 is a gamma variable. 


3.4 The Normal Distribution 


Motivation for the normal distribution is found in the Central Limit Theorem, which 
is presented in Section 5.3. This theorem shows that normal distributions provide 
an important family of distributions for applications and for statistical inference, 
in general. We proceed by first introducing the standard normal distribution and 
through it the general normal distribution. 


Consider the integral 
od —2? 
T= — exp | —— } dz. 3.4.1 
[. on »( D ) el) 
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This integral exists because the integrand is a positive continuous function that is 
bounded by an integrable function; that is, 


2 
0 < exp (=) <exp(—|z| +1), -o0<z<o, 


and os 
/ exp(—|z| + 1) dz = 2e. 


—Co 


To evaluate the integral J, we note that J > 0 and that J? may be written 


1 fore) foe) 2 2 
raz i. exp (-; >= ) dzdw. 
T Joo J—0o 


This iterated integral can be evaluated by changing to polar coordinates. If we set 
z=rcos@ and w=rsin@, we have 


‘ i 20 ee) 72/9 
r= — e r dr d0 
27 Jo 0 
al 27 


= — dé = 1. 
27 0 


Because the integrand of display (3.4.1) is positive on R and integrates to 1 over 
R, it is a pdf of a continuous random variable with support R. We denote this 
random variable by Z. In summary, Z has the pdf 


Il —2? 
z) = ——exp | — ], -w<z<o. 3.4.2 
f(2) = za ow (=) (3:42) 
For t € R, the mgf of Z can be derived by a completion of a square as follows: 
ee 1 1 
Elexp{tZ = exp{tz exp ¢ —=2z7} dz 
esn(iz}] = | extt}eo{-32"} 
1 a 1 


1 eae 1 
exp {5°} / Tz exp {sur} dw, (3.4.3) 
co V2T 


where for the last integral we made the one-to-one change of variable w = z—t. By 
the identity (3.4.2), the integral in expression (3.4.3) has value 1. Thus the megf of 
Z is 
1 
Mz(t) = exp {5°} , for —co <t < ov. (3.4.4) 


The first two derivatives of Mz(t) are easily shown to be 


1 
MI) = texp {50} 
u" 1 2 2 1 2 
M;7(t) = exp 5¢ +t* exp 3¢ : 
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Upon evaluating these derivatives at t = 0, the mean and variance of Z are 
E(Z) =0 and Var(Z) = 1. (3.4.5) 
Next, define the continuous random variable X by 
X =bZ +a, 


for b > 0. This is a one-to-one transformation. To derive the pdf of X, note that 
the inverse of the transformation and the Jacobian are z = b-'(a—a) and J = b"},7 
respectively. Because b > 0, it follows from (3.4.2) that the pdf of X is 


1 joa? 
jute) = pecs -} (—*) \ -—CcO<2r< mw. 


By (3.4.5), we immediately have E(X) = a and Var(X) = b?. Hence, in the 
expression for the pdf of X, we can replace a by w = E(X) and b? by o? = Var(X). 
We make this formal in the following: 


Definition 3.4.1 (Normal Distribution). We say a random variable X has a nor- 
mal distribution if its pdf is 


He) = Geese | 5 (=) }. for —co <2 < 00. (3.4.6) 


V 210 oO 


The parameters ps and o? are the mean and variance of X, respectively. We often 
write that X has a N(p,07) distribution. 


In this notation, the random variable Z with pdf (3.4.2) has a N(0, 1) distribution. 
We call Z a standard normal random variable. 

For the megf of X, use the relationship X = 0Z + yu and the mef for Z, (3.4.4), 
to obtain 


Blexp{tX}] = Blexp{t(oZ + 1)}] = exp{ut} Blexp{toZ}} 


1 1 
= exp{pt} exp { sou} = exp {ut + sor} , (3.4.7) 


for —co < t < oo. 
We summarize the above discussion, by noting the relationship between Z and 
Xx: 


X has a N(,07) distribution if and only if Z = ae has a N(0,1) distribution. 
(3.4.8) 
Let X have a N(,07) distribution. The graph of the pdf of X is seen in 
Figure 3.4.1 to have the following characteristics: (1) symmetry about a vertical 
axis through x = py; (2) having its maximum of 1/(oV27) at x = yu; and (3) having 
the z-axis as a horizontal asymptote. It should also be verified that (4) there are 
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f(x) 
A 
a 
\2n0 
| | | -—t | pa x 
- 30 pu-20 Lo lw ut+o u+2o ut 30 


Figure 3.4.1: The normal density f(x), (3.4.6). 


points of inflection at « = w +o; see Exercise 3.4.7. By the symmetry about p, it 
follows that the median of a normal distribution is equal to its mean. 
If we want to determine P(X < x), then the following integration is required: 


en (tm)? /(207) ap. 


Px <a)= [ : 


—co V2T0 


From calculus we know that the integrand does not have an antiderivative; hence, 
the integration must be carried out by numerical integration procedures. The R 
software uses such a procedure for its function pnorm. If X has a N(u,07) distribu- 
tion, then the R call pnorm(x, 1,0) computes P(X < x), while q = qnorm(p, 11,0) 
gives the pth quantile of X; i.e., gq solves the equation P(X <q) = p. We illustrate 
this computation in the next example. 


Example 3.4.1. Suppose the height in inches of an adult male is normally dis- 
tributed with mean y = 70 inches and standard deviation o = 4 inches. As a 
graph of the pdf of X use Figure 3.4.1 replacing 4 by 70 and o by 4. Suppose 
we want to compute the probability that a man exceeds six feet (72 inches) in 
height. Locate 72 on the figure. The desired probability is the area under the curve 
over the interval (72, co) which is computed in R by 1-pnorm(72,70,4) = 0.3085; 
hence, 31% of males exceed six feet in height. The 95th percentile in height is 
qnorm(0.95,70,4) = 76.6 inches. What percentage of males have heights within 
one standard deviation of the mean? Answer: pnorm(74,70,4) - pnorm(66,70,4) 
= 0.6827. 


Before the age of modern computing tables of probabilities for normal distribu- 
tions were formulated. Due to the fact (3.4.8), only tables for the standard normal 
distribution are required. Let Z have the standard normal distribution. A graph of 
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its pdf is displayed in Figure 3.4.2. Common notation for the cdf of Z is 
1 
V2 


Table II of Appendix D displays a table for ®(z) for specified values of z > 0. To 
compute ®(—z), where z > 0, use the identity 


et /2 dt, 00 <z< 00. (3.4.9) 


B(—z) =1- (2). (3.4.10) 


This identity follows because the pdf of Z is symmetric about 0. It is apparent in 
Figure 3.4.2 and the reader is asked to show it in Exercise 3.4.1. 


0(2) 


®(Z)) =p 


Zp (0,0) 


Figure 3.4.2: The standard normal density: p = ®(z,) is the area under the curve 
to the left of zp. 


As an illustration of the use of Table I, suppose in Example 3.4.1 that we want 
to determine the probability that the height of an adult male is between 67 and 71 
inches. This is calculated as 


P(67< X < 71) 


P(X < 71) — P(X < 67) 
_ p (XG HEM) _p(44R <0) 


4 4 4 4 
= P(Z <0.25) — P(Z < —0.75) = (0.25) — 1 + (0.75) 
= 0.5987 — 1+ 0.7734 = 0.3721 (3.4.11) 


= pnorm(71, 70,4) — pnorm(67, 70, 4) = 0.372079. (3.4.12) 


Expression (3.4.11) is the calculation by using Table II, while the last line is the cal- 
culation by using the R function pnorm. More examples are offered in the exercises. 
As a final note on Table II, it is generated by the R function: 
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normtab <- function(){ za <- seq(0.00,3.59, .01); 
pz <- t(matrix(round(pnorm(za) ,digits=4) ,nrow=10) ) 
colnames(pz) <- seq(0,.09,.01) 
rownames(pz) <- seq(0.0,3.5,.1); return(pz)} 
The function normtab can be downloaded at the site mentioned in the Preface. 


Example 3.4.2 (Empirical Rule). Let X be N(yu,07). Then, by Table II or R, 


P(u—20<X<pt20) = g (AP eroH) 9 (Hoe) 
= (2) - 6(-2) 


= 0.977 — (1 — 0.977) = 0.954. 


Similarly, P(uw—o0 < X < w+o) = 0.6827 and P(w— 30 < X < wt3c) = 0.9973. 
Sometimes these three intervals and their corresponding probabilities are referred 
to as the empirical rule. Note that we can use Chebyshev’s Theorem (Theorem 
1.10.3), to obtain lower bounds for these probabilities. While the empirical rule is 
much more precise, it also requires the assumption of a normal distribution. On the 
other hand, Chebyshev’s theorem requires only the assumption of a finite variance. 
a 


Example 3.4.3. Suppose that 10% of the probability for a certain distribution that 
is N(u,07) is below 60 and that 5% is above 90. What are the values of and a? 
We are given that the random variable X is N(,07) and that P(X < 60) = 0.10 
and P(X < 90) = 0.95. Thus ®[(60 — jz) /o] = 0.10 and ®[(90 — 44)/o] = 0.95. From 
Table II we have 


= —1.28, = 1.64. 


60 — pw 90 — p 
a a 


These conditions require that 4 = 73.1 and o = 10.2 approximately. m 


Remark 3.4.1. In this chapter we have illustrated three types of parameters 
associated with distributions. The mean p of N(1,07) is called a location param- 
eter because changing its value simply changes the location of the middle of the 
normal pdf; that is, the graph of the pdf looks exactly the same except for a shift 
in location. The standard deviation o of N(y,07) is called a scale parameter 
because changing its value changes the spread of the distribution. That is, a small 
value of o requires the graph of the normal pdf to be tall and narrow, while a large 
value of o requires it to spread out and not be so tall. No matter what the values 
of w and a, however, the graph of the normal pdf is that familiar “bell shape.” In- 
cidentally, the @ of the gamma distribution is also a scale parameter. On the other 
hand, the a of the gamma distribution is called a shape parameter, as changing 
its value modifies the shape of the graph of the pdf, as can be seen by referring to 
Figure 3.3.1. The parameters p and ys of the binomial and Poisson distributions, 
respectively, are also shape parameters. m 


Continuing with the first part of Remark 3.4.1, if X is N(u,07) then we say 
that X follows the location model which we write as 


X=pte, (3.4.13) 
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where e is a random variable (often called random error) with a N(0,07) distribu- 
tion. Conversely, it follows immediately that if X satisfies expression (3.4.13) with 
e distributed N(0,07) then X has a N(y, 07) distribution. 


We close this part of the section with three important results. 


Example 3.4.4 (All the Moments of a Normal Distribution). Recall that in Ex- 
ample 1.9.7, we derived all the moments of a standard normal random variable by 
using its moment generating function. We can use this to obtain all the moments 
of X, where X has a N(y, 07) distribution. From expression (3.4.13), we can write 
X =oZ+ p, where Z has a N(0, 1) distribution. Hence, for all nonnegative integers 
k a simple application of the binomial theorem yields 
* (k 
E(X*) =E{(oZ +p) =) > ( ‘ of E(Z))uk-5, (3.4.14) 
j=0 

Recall from Example 1.9.7 that all the odd moments of Z are 0, while all the even 
moments are given by expression (1.9.3). These can be substituted into expression 
(3.4.14) to derive the moments of X. ™ 


Theorem 3.4.1. If the random variable X is N(u,07), 0? > 0, then the random 
variable V = (X — 11)?/o0? is x7(1). 


Proof. Because V = W?, where W = (X — 1)/o is N(0,1), the cdf G(v) for V 
is, for v > 0, 
Gv) = P(W? < v) = P(-Vu< W< v2). 
That is, 


and 
G(v) =0, v<0. 
If we change the variable of integration by writing w = \/y, then 


s il 
G(v =| e¥? dy, O<v. 
) 0 V2nV¥ 


Hence the pdf g(v) = G’(v) of the continuous-type random variable V is 


wy te? 0<v<© 


0 elsewhere. 


[ seav=1 
0 
) = 7 and thus V is y?(1). 


g(v) = 
Since g(v) is a pdf 
hence, it must be that [(5 


One of the most important properties of the normal distribution is its additivity 
under independence. 
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Theorem 3.4.2. Let X1,...,Xn be independent random variables such that, for 
i=1,...,n, X; has a Mane 2) distribution. Let Y = >” 1 Xi, Where ay,...,Gn 
are constants Then the distribution of Y 1s NS) Gili, 3) OO). 


t=1 “U4 
Proof: By Theorem 2.6.1, for t € R, the mgf of Y is 


My(t) = [Jes ttn + + (1/2)#7a7 


olsun (1/2) Px se} 
i=1 


which is the mgf of a N(S7¥_, aii, 90, a2?) distribution. m 


A simple corollary to this result gives the distribution of the sample mean X = 
n—1 So, X; when X1, X2,...Xp represents a random sample from a N (1, 0). 


Corollary 3.4.1. Let X1,..., Xp be iid random variables with a common N(U, 07) 
distribution. Let X =n~1>;_, X;. Then X has a N(1,0?/n) distribution. 


To prove this corollary, simply take a; = (1/n), ui = pw, and o? = o?, for 
i=1,2,...,n, in Theorem 3.4.2. 


3.4.1 *Contaminated Normals 


We next discuss a random variable whose distribution is a mixture of normals. As 
with the normal, we begin with a standardized random variable. 

Suppose we are observing a random variable that most of the time follows a 
standard normal distribution but occasionally follows a normal distribution with 
a larger variance. In applications, we might say that most of the data are “good” 
but that there are occasional outliers. To make this precise let Z have a N(0, 1) 
distribution; let I;_. be a discrete random variable defined by 


oe 1 with probability 1 — « 
t~e | 0. with probability e, 


and assume that Z and ,_, are independent. Let W = Zh_. + 0-Z(1 — T_<). 
Then W is the random variable of interest. 
The independence of Z and ;_, imply that the cdf of W is 
Fw(w)=PIW <u) = P[(W<w,h_-.=1)4+ P[W <u, h_. =0] 
= P(W < wilh. = 1Pih_. => 1] 
+ P|W < wilh. = 0|P[h_. => 0] 

= PiZ<w\l-o)4+P[Z < w/ole. 

= O(w)(1l—6)+ ®(w/o-)e (3.4.15) 
Therefore, we have shown that the distribution of W is a mixture of normals. 
Further, because W = Z_. + 0-Z(1 — 1_<), we have 


E(W) =0 and Var(W) = 1+ (0? — 1); (3.4.16) 
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see Exercise 3.4.24. Upon differentiating (3.4.15), the pdf of W is 


fev (w) = o(w)(1 — €) + 6(w/oe)—, (3.4.17) 


c 


where ¢ is the pdf of a standard normal. 
Suppose, in general, that the random variable of interest is X = a+ bW, where 
b > 0. Based on (3.4.16), the mean and variance of X are 


E(X) =a and Var(X) = b?(1+€(o? — 1)). (3.4.18) 


From expression (3.4.15), the cdf of X is 


F(x) = (——*) a-+0(2—*)« (3.4.19) 


boc 


which is a mixture of normal cdfs. 

Based on expression (3.4.19) it is easy to obtain probabilities for contami- 
nated normal distributions using R. For example, suppose, as above, W has cdf 
(3.4.15). Then P(W < w) is obtained by the R command (1-eps)*pnorm(w) + 
eps*pnorm(w/sigc), where eps and sige denote € and o¢, respectively. Similarly, 
the pdf of W at w is returned by (1-eps)*dnorm(w) + eps*dnorm(w/sigc)/sigc. 
The functions pen and dcn’ compute the cdf and pdf of the contaminated normal, 
respectively. In Section 3.7, we explore mixture distributions in general. 


EXERCISES 
3.4.1. If 


show that @(—z) = 1 — ®(z). 


3.4.2. If X is N(75, 100), find P(X < 60) and P(70 < X < 100) by using either 
Table II or the R command pnorn. 


3.4.3. If X is N(u,07), find b so that P[—b < (X — p)/o < b] = 0.90, by using 
either Table II of Appendix D or the R command qnorn. 


3.4.4. Let X be N(p,07) so that P(X < 89) = 0.90 and P(X < 94) =0.95. Find 
pand o?. 


3.4.5. Show that the constant c can be selected so that f(a) = c2-® co <a< 
co, satisfies the conditions of a normal pdf. 
Hint: Write 2 = ele?. 


3.4.6. If X is N(u,07), show that E(|X — p|) = o./2/n. 


3.4.7. Show that the graph of a pdf N(,07) has points of inflection at 2 = pp —o 
andz=p-+o. 


7Downloadable at the site listed in the Preface. 
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3.4.8. Evaluate f, exp[—2(a — 3)?] dz. 
3.4.9. Determine the 90th percentile of the distribution, which is N(65, 25). 
3.4.10. If e3+8! is the mgf of the random variable X, find P(—1 < X < 9). 


3.4.11. Let the random variable X have the pdf 


2 
f(x) = e/2, Q <a <co, zero elsewhere. 


V20 , 
(a) Find the mean and the variance of X. 
(b) Find the cdf and hazard function of X. 


Hint for (a): Compute E(X) directly and E(X?) by comparing the integral with 
the integral representing the variance of a random variable that is N(0, 1). 


3.4.12. Let X be N(5,10). Find P[0.04 < (X —5)? < 38.4]. 
3.4.13. If X is N(1,4), compute the probability P(1 < X? < 9). 


3.4.14. If X is N(75, 25), find the conditional probability that X is greater than 
80 given that X is greater than 77. See Exercise 2.3.12. 


3.4.15. Let X be a random variable such that E(X?”) = (2m)!/(2™m!), m = 
1,2,3,... and E(X?™—1) = 0, m =1,2,3,.... Find the mgf and the pdf of X. 


3.4.16. Let the mutually independent random variables X;, X2, and X3 be N(0, 1), 
N(2,4), and N(—1,1), respectively. Compute the probability that exactly two of 


these three variables are less than zero. 


3.4.17. Compute the measures of skewness and kurtosis of a distribution which 
is N(u,07). See Exercises 1.9.14 and 1.9.15 for the definitions of skewness and 
kurtosis, respectively. 


3.4.18. Let the random variable X have a distribution that is N(, 07). 
(a) Does the random variable Y = X? also have a normal distribution? 


(b) Would the random variable Y = aX + b, a and b nonzero constants have a 
normal distribution? 
Hint: In each case, first determine P(Y < y). 


3.4.19. Let the random variable X be N(,07). What would this distribution be 
if o? = 0? 
Hint: Look at the mef of X for 0? > 0 and investigate its limit as 0? +0. 


3.4.20. Let Y have a truncated distribution with pdf g(y) = ¢(y)/[®(b) — ®(a)], 
for a < y < b, zero elsewhere, where ¢(a) and ®(x) are, respectively, the pdf and 
distribution function of a standard normal distribution. Show then that E(Y) is 


equal to [4(a) — $(6)]/[®(b) — ®(a)]. 
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3.4.21. Let f(x) and F(x) be the pdf and the cdf, respectively, of a distribution of 
the continuous type such that f’(x) exists for all 7. Let the mean of the truncated 
distribution that has pdf g(y) = f(y)/F(b), —co < y < b, zero elsewhere, be 
equal to —f(b)/F(b) for all real b. Prove that f(a) is a pdf of a standard normal 
distribution. 


3.4.22. Let X and Y be independent random variables, each with a distribution 
that is N(0,1). Let 7 = X + Y. Find the integral that represents the cdf G(z) = 
P(X +Y <z) of Z. Determine the pdf of Z. 

Hint: We have that G(z) = f°, H(z, z) dx, where 


(x2) =f srexvl-(o? +y?)/2l dy. 


Find G’(z) by evaluating f° [0H (2, z)/0z] dz. 


3.4.23. Suppose X is a random variable with the pdf f(a) which is symmetric 
about 0; ie., f(—a2) = f(x). Show that F(—a) = 1— F(x), for all x in the support 
of X. 


3.4.24. Derive the mean and variance of a contaminated normal random variable. 
They are given in expression (3.4.16). 


3.4.25. Investigate the probabilities of an “outlier” for a contaminated normal ran- 
dom variable and a normal random variable. Specifically, determine the probability 
of observing the event {|X| > 2} for the following random variables (use the R 
function pen for the contaminated normals): 


(a) X has a standard normal distribution. 


(b) X has a contaminated normal distribution with cdf (3.4.15), where « = 0.15 
and o, = 10. 


(c) X has a contaminated normal distribution with cdf (3.4.15), where e€ = 0.15 
and o, = 20. 


(d) X has a contaminated normal distribution with cdf (3.4.15), where « = 0.25 
and o, = 20. 


3.4.26. Plot the pdfs of the random variables defined in parts (a)—(d) of the last 
exercise. Obtain an overlay plot of all four pdfs also. In R the domain values of the 
pdfs can easily be obtained by using the seq command. For instance, the command 
x<-seq(-6,6,.1) returns a vector of values between —6 and 6 in jumps of 0.1. 
Then use the R function dcn for the contaminated normal pdfs. 


3.4.27. Consider the family of pdfs indexed by the parameter a, —co < a < ~, 
given by 

f(a; a) = 26(x)®(axr), -w<4<m, (3.4.20) 
where ¢(a) and ®() are respectively the pdf and cdf of a standard normal distri- 
bution. 
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(a) Clearly f(x;a) > 0 fo all x. Show that the pdf integrates to 1 over (—o0, 00). 
Hint: Start with 


[tesayar=2 fot) foe 


Next sketch the region of integration and then combine the integrands and 
use the polar coordinate transformation we used after expression (3.4.1). 


Note that f(xz;a@) is the N(0,1) pdf for a = 0. The pdfs are left skewed for 
a < 0 and right skewed for a > 0. Using R, verify this by plotting the pdfs 
for a = —3,—2,—1,1,2,3. Here’s the code for a = —3: 

x=seq(-5,5,.01); alp =-3; y=2*dnorm(x) *pnorm(alp*x) ; plot (y~x) 


(b 


a 


This family is called the skewed normal family; see Azzalini (1985). 
3.4.28. For Z distributed N(0, 1), it can be shown that 


E[®(hZ + k)] = O[k/V1 4 h?]; 
see Azzalini (1985). Use this fact to obtain the mef of the pdf (3.4.20). Next obtain 
the mean of this pdf. 
3.4.29. Let X, and X2 be independent with normal distributions N(6,1) and 
N(7, 1), respectively. Find P(X, > X2). 
Hint: Write P(X, > X2) = P(X, — X2 > 0) and determine the distribution of 
X1— Xo. 
3.4.30. Compute P(X, + 2X2 — 2X3 > 7) if X1,X2,X3 are iid with common 
distribution N(1, 4). 


3.4.31. A certain job is completed in three steps in series. The means and standard 
deviations for the steps are (in minutes) 


Step Mean Standard Deviation 


1 17 2 
2 13 1 
3 13 2 


Assuming independent steps and normal distributions, compute the probability that 
the job takes less than 40 minutes to complete. 


3.4.32. Let X be N(0,1). Use the moment generating function technique to show 
that ¥ = X? is 47(1). 

Hint: Evaluate the integral that represents E(e'*’) by writing w = 7/1 — 21, 
t< 4. 

3.4.33. Suppose X;, X92 are iid with a common standard normal distribution. Find 
the joint pdf of Yj = X?7 + X2 and Yj = X2 and the marginal pdf of Yj. 

Hint: Note that the space of Y; and Y9 is given by —\/y1 < y2 < /y1,0 < y1 < oo. 
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3.5 The Multivariate Normal Distribution 


In this section we present the multivariate normal distribution. In the first part 
of the section, we introduce the bivariate normal distribution, leaving most of the 
proofs to the later section, Section 3.5.2. 


3.5.1 Bivariate Normal Distribution 
We say that (X,Y) follows a bivariate normal distribution if its pdf is given by 
1 
2 


a ee UM ee, =< y<o, (3.5.1) 
270102\/1—p 


f(a, y) = 


where 


eal) Sear ee 


and —oo < pj < 00, a; > 0, fori = 1,2, and p satisfies p? < 1. Clearly, this function 
is positive everywhere in R?. As we show in Section 3.5.2, it is a pdf with the mgf 
given by: 


1 
M x,y) (th, to) = exp {is + tale + 5 (tor + 2t1t2p0102 + i03) | . (3.5.3) 
Thus, the mgf of X is 


1 
Mx(t1) = Mcx,y)(t1, 0) = exp {tn 7 stot} ; 


hence, X has a N(j11,07) distribution. In the same way, Y has a N(2, 03) distri- 
bution. Thus j1; and juz are the respective means of X and Y and of and 0% are the 
respective variances of X and Y. For the parameter p, Exercise 3.5.3 shows that 


(0, 0) = P0102 + }1 p42. (3.5.4) 


Hence, cov(X,Y) = poic2 and thus, as the notation suggests, p is the correlation 
coefficient between X and Y. We know by Theorem 2.5.2 that if X and Y are 
independent then p = 0. Further, from expression (3.5.3), if p = 0 then the joint 
mef of (X,Y) factors into the product of the marginal mgfs and, hence, X and Y are 
independent random variables. Thus if (X,Y) has a bivariate normal distribution, 
then X and Y are independent if and only if they are uncorrelated. 

The bivariate normal pdf, (3.5.1), is mound shaped over R? and peaks at its 
mean ({01, [l2); see Exercise 3.5.4. For a given c > 0, the points of equal probability 
(or density) are given by {(2,y) : f(a,y) = c}. It follows with some algebra that 
these sets are ellipses. In general for multivariate distributions, we call these sets 
contours of the pdfs. Hence, the contours of bivariate normal distributions are 
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elliptical. If X and Y are independent then these contours are circular. The in- 
terested reader can consult a book on multivariate statistics for discussions on the 
geometry of the ellipses. For example, if 0 = og and p > 0, the main axis of the 
ellipse goes through the mean at a 45° angle; see Johnson and Wichern (2008) for 
discussion. 

Figure 3.5.1 displays a three-dimensional plot of the bivariate normal pdf with 
(141, H2) = (0,0), 01 = op = 1, and p = 0.5. For location, the peak is at (1, 2) = 
(0,0). The elliptical contours are apparent. Locate the main axis. For a region A 
in the plane, P[(X,Y) € A] is the volume under the surface over A. In general such 
probabilities are calculated by numerical integration methods. 
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Figure 3.5.1: A sketch of the surface of a bivariate normal distribution with mean 
(0,0), 01 = 02 = 1, and p=0.5. 


In the next section, we extend the discussion to the general multivariate case; 
however, Remark 3.5.1, below, returns to the bivariate case and can be read with 
minor knowledge of vector and matrices. 


3.5.2 *Multivariate Normal Distribution, General Case 


In this section we generalize the bivariate normal distribution to the n-dimensional 
multivariate normal distribution. As with Section 3.4 on the normal distribution, 
the derivation of the distribution is simplified by first discussing the standardized 


variable case and then proceeding to the general case. Also, in this section, vector 
and matrix notation are used. 
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Consider the random vector Z = (Z,..., Zn)’, where Z1,..., Zp, are iid N(0, 1) 
random variables. Then the density of Z is 


a 4 1 Ne cc i 
fa(z) = Seow {527} = (5) on} | 
2 Ila 2 on 2a 


n/2 
1 1 
x) exp { ~J2'ah ; (3.5.5) 


for z € R”. Because the Z;s have mean 0, have variance 1, and are uncorrelated, 
the mean and covariance matrix of Z are 


l| 
—-~ 


E|Z] =0 and Cov(Z] =In, (3.5.6) 


where I, denotes the identity matrix of order n. Recall that the mgf of Z; evaluated 
at t; is exp{t?/2}. Hence, because the Zs are independent, the mef of Z is 


Mz(t) = Elexp{t'Z}] = E 


ii exp{t; Zi} 
a 


iT 1 
of Ea = exp{ set} : (3.5.7) 
i=l 


for all t € R”. We say that Z has a multivariate normal distribution with 
mean vector 0 and covariance matrix I,,. We abbreviate this by saying that Z has 
an N,,(0,1,) distribution. 

For the general case, suppose & is an nxn, symmetric, and positive semi-definite 
matrix. Then from linear algebra, we can always decompose ¥ as 


= II E [exp{t;Z;}] 


DS =Iar4, (3.5.8) 


where A is the diagonal matrix A = diag(A1, A2,.--, An), Ar > A2 2 ++: & An > O 
are the eigenvalues of S, and the columns of I’, vi, v2,..., Vn, are the corresponding 
eigenvectors. This decomposition is called the spectral decomposition of &. The 
matrix I is orthogonal, i.c., I~! = I’, and, hence, PI’ = I. As Exercise 3.5.19 
shows, we can write the spectral decomposition in another way, as 


S=Par=) dw (3.5.9) 


i=1 


Because the \;s are nonnegative, we can define the diagonal matrix Al/? = 
diag {V/A1,..-, VAn}. Then the orthogonality of T' implies 


S= (APT APT). 


We define the matrix product in brackets as the square root of the positive semi- 
definite matrix 4 and write it as 


Sl? WAlep, (3.5.10) 
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Note that S!/? is symmetric and positive semi-definite. Suppose © is positive 
definite; that is, all of its eigenvalues are strictly positive. Based on this, it is then 
easy to show that 


= 
(=) =A 7 (3.5.11) 


see Exercise 3.5.13. We write the left side of this equation as &~!/?. These matrices 
enjoy many additional properties of the law of exponents for numbers; see, for 
example, Arnold (1981). Here, though, all we need are the properties given above. 
Suppose Z has a N,,(0,1,,) distribution. Let © be a positive semi-definite, 
symmetric matrix and let w be an n x 1 vector of constants. Define the random 

vector X by 
X= 57 4 pw. (3.5.12) 


By (3.5.6) and Theorem 2.6.3, we immediately have 
E[X| =e and Cov/X)= SV2E? = >. (3.5.13) 
Further, the mgf of X is given by 


Mx(t) = Elexp{t’X}] = E lexp{t'E'/?z if: t'}| 
= exp{t’pw}FE lexp { (zie)'z}] 


/ 
= exp{t'p}exp {a /2) (=t) sia 
= exp{t’u}exp{(1/2)t’dt}. (3.5.14) 
This leads to the following definition: 


Definition 3.5.1 (Multivariate Normal). We say an n-dimensional random vector 
X has a multivariate normal distribution if its maf is 


Mx(t) = exp {t’w + (1/2)t’St}, for allt € R”. (3.5.15) 


where & is a symmetric, positive semi-definite matrix and uw € R”. We abbreviate 
this by saying that X has a N,,(ps, 2) distribution. 


Note that our definition is for positive semi-definite matrices U. Usually & is 
positive definite, in which case we can further obtain the density of X. If & is 
positive definite, then so is ©!/? and, as discussed above, its inverse is given by 
expression (3.5.11). Thus the transformation between X and Z, (3.5.12), is one-to- 
one with the inverse transformation 


Z = 5-/2(X — p) 


and the Jacobian |©~1!/?| = ||~1/?. Hence, upon simplification, the pdf of X is 
given by 


fx(x) = aE? {31% —p) D(x )} , forxe€ R”. (3.5.16) 
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In Section 3.5.1, we discussed the contours of the bivariate normal distribution. 
We now extend that discussion to the general case, adding probabilities to the 
contours. Let X have a N,,(,&) distribution. In the n-dimensional case, the 
contours of constant probability for the pdf of X, (3.5.16), are the ellipsoids 


(x— py D(x pw) =e’, 


for c > 0. Define the random variable Y = (X — w)/S~!(X — p). Then using 
expression (3.5.12), we have 


Yer SSS 2=2770=) 2). 
i=1 
Since Z,,...,Z, are iid N(0,1), Y has x?-distribution with n degrees of freedom. 
Denote the cdf of Y by F\2. Then we have 


P(X — pYS'(K - p) < e] = PY <c?) = Fa (c’). (3.5.17) 


These probabilities are often used to label the contour plots; see Exercise 3.5.5. For 
reference, we summarize the above proof in the following theorem. Note that this 
theorem is a generalization of the univariate result given in Theorem 3.4.1. 


Theorem 3.5.1. Suppose X has a N,(,™%) distribution, where X is positive defi- 
nite. Then the random variable Y = (X — p)'S~1(X— p) has a y?(n) distribution. 


The following two theorems are very useful. The first says that a linear trans- 
formation of a multivariate normal random vector has a multivariate normal distri- 
bution. 


Theorem 3.5.2. Suppose X has a N,,(u, 4) distribution. Let Y = AX +b, where 
A isanmxn matric andb € R™. Then Y has a N»(Ap+b, AXA’) distribution. 


Proof: From (3.5.15), for t € R™, the mgf of Y is 


My(t) = E [exp {t’Y}] 

E [exp {t'(AX + b)}] 

= exp{t'b} E [exp {(A’t)/X}] 

exp {t/b} exp {(A't)!p + (1/2)(A't)'5(A't)} 
= exp{t’(Auw+b) +4 (1/2)t’AZA’t}, 


which is the mef of an N,,(Ap +b, AXA’) distribution. m 


A simple corollary to this theorem gives marginal distributions of a multivariate 
normal random variable. Let Xj, be any subvector of X, say of dimension m < 
n. Because we can always rearrange means and correlations, there is no loss in 
generality in writing X as 


X= | = | ; (3.5.18) 
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where Xz is of dimension p = n — m. In the same way, partition the mean and 
covariance matrix of X; that is, 


pe | | and X= | S11 212 | (3.5.19) 


with the same dimensions as in expression (3.5.18). Note, for instance, that 4 
is the covariance matrix of X, and 42 contains all the covariances between the 
components of X; and Xz. Now define A to be the matrix 


A= [In ‘Oval; 


where Om» is an m xX p matrix of zeroes. Then X; = AX. Hence, applying Theorem 
3.5.2 to this transformation, along with some matrix algebra, we have the following 
corollary: 


Corollary 3.5.1. Suppose X has a N,(p,%) distribution, partitioned as in expres- 
sions (3.5.18) and (3.5.19). Then Xy has a Ny(py, 11) distribution. 


This is a useful result because it says that any marginal distribution of X is also 
normal and, further, its mean and covariance matrix are those associated with that 
partial vector. 

Recall in Section 2.5, Theorem 2.5.2, that if two random variables are indepen- 
dent then their covariance is 0. In general, the converse is not true. However, as 
the following theorem shows, it is true for the multivariate normal distribution. 


Theorem 3.5.3. Suppose X has a N,,(yt,) distribution, partitioned as in the 
expressions (8.5.18) and (3.5.19). Then X, and X2 are independent if and only if 
Vp =O. 


Proof: First note that No; = X45. The joint mgf of X; and X2 is given by 


1 
Mx, Xs (ti, to) = exp {tis + thus + 5 (Heit + th Maate + th haiti + ti Zrate)} 


(3.5.20) 
where t’ = (t{,t) is partitioned the same as yw. By Corollary 3.5.1, Xy has a 
Nm(1, 411) distribution and Xz has a N, (fo, N22) distribution. Hence, the prod- 
uct of their marginal megfs is 


1 
Mx, (t;) Mx, (t2) = exp {tis + th ps + 5 (t) Sait, + tsEaats)} 2 (3.5.21) 
By (2.6.6) of Section 2.6, X; and X» are independent if and only if the expressions 
(3.5.20) and (3.5.21) are the same. If 42 = O’ and, hence, Ng; = O, then the 
expressions are the same and X, and Xp» are independent. If X, and X2 are 


independent, then the covariances between their components are all 0; i.e., Hy2 = O' 
and Xin; = O. 


Corollary 3.5.1 showed that the marginal distributions of a multivariate normal 
are themselves normal. This is true for conditional distributions, too. As the 


204 Some Special Distributions 


following proof shows, we can combine the results of Theorems 3.5.2 and 3.5.3 to 
obtain the following theorem. 


Theorem 3.5.4. Suppose X has a N,(u, =) distribution, which is partitioned as 
in expressions (3.5.18) and (3.5.19). Assume that & is positive definite. Then the 
conditional distribution of X1 |X» is 


Nm (ty + 12059 (Xe = Ly), Mau = Sig Pios Dai). (3.5.22) 


Proof: Consider first the joint distribution of the random vector W = X, — 
Die S5 Xs and X». This distribution is obtained from the transformation 


W | ide. —Ses X1 
X2/| | O I, Xp | 
Because this is a linear transformation, it follows from Theorem 3.5.2 that the joint 


distribution is multivariate normal, with E[W] = a, — S12)59 Ms, E[X2] = po, 
and covariance matrix 


In —12255 X11 Yap In Oo’ | | 
O Li Hai Lee —Yx =a I, | 


Sir — Dy D5y Dai O’ 
O doo | 


Hence, by Theorem 3.5.3 the random vectors W and Xz are independent. Thus 
the conditional distribution of W | Xz is the same as the marginal distribution of 
W;; that is, 


W | Xo is Nn (ty — D12B55 Me, E11 — D12E59 E21). 
Further, because of this independence, W + Di2h55 Ko given Xz is distributed as 
Nim (fy — E12B 3g Me + D12E99 X2, Vi — YV12B5y B21), (3.5.23) 
which is the desired result. m 


In the following remark, we return to the bivariate normal using the above 
general notation. 


Remark 3.5.1 (Continuation of the Bivariate Normal). Suppose (X,Y) has a 
No(u, %) distribution, where 


2 

p= | Si | and S = | ut: ee | (3.5.24) 
M2 O12 99 

Substituting po,o2 for oj2 in &, it is easy to see that the determinant of % is 

o703(1 — p?). Recall that p? < 1. For the remainder of this remark, assume that 

p” <1. In this case, © is invertible (it is also positive definite). Further, since © is 

a 2 x 2 matrix, its inverse can easily be determined to be 


1 oe —p0102 
>t = — 7 : 3.5.25 
ool — PP) | poi, oF ee) 
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This shows the equivalence of the bivariate normal pdf notation, (3.5.1), and the 
general multivariate normal distribution with n = 2 pdf notation, (3.5.16). 

To simplify the conditional normal distribution (3.5.22) for the bivariate case, 
consider once more the bivariate normal distribution that was given in Section 3.5.1. 
For this case, reversing the roles so that Y = X, and X = Xo, expression (3.5.22) 
shows that the conditional distribution of Y given X = z is 


oO 
N a + p(w — 1), 03(1 — p”) (3.5.26) 


Thus, with a bivariate normal distribution, the conditional mean of Y, given that 
X = 17, is linear in x and is given by 


02 
E(Y |x) = po + oe — #1). 


Although the mean of the conditional distribution of Y, given X = x, depends 
upon x (unless p = 0), the variance o}(1— p?) is the same for all real values of «. 
Thus, by way of example, given that X = x, the conditional probability that Y is 
within (2.576)o2\/1 — p? units of the conditional mean is 0.99, whatever the value 
of « may be. In this sense, most of the probability for the distribution of X and Y 
lies in the band 


pg + p(x — pt) £ 2.57602\/1 — p? 
1 


about the graph of the linear conditional mean. For every fixed positive a2, the 
width of this band depends upon p. Because the band is narrow when p? is nearly 
1, we see that p does measure the intensity of the concentration of the probability 
for X and Y about the linear conditional mean. We alluded to this fact in the 
remark of Section 2.5. 

In a similar manner we can show that the conditional distribution of X, given 
Y =y, is the normal distribution 


N |p + ply —p2), o7(1—p*)]. 

Example 3.5.1. Let us assume that in a certain population of married couples the 
height X, of the husband and the height X2 of the wife have a bivariate normal 
distribution with parameters pu; = 5.8 feet, “2 = 5.3 feet, 0, = 02 = 0.2 foot, and 
p = 0.6. The conditional pdf of X2, given X, = 6.3, is normal, with mean 5.3 + 
(0.6)(6.3 — 5.8) = 5.6 and standard deviation (0.2),/(1 — 0.36) = 0.16. Accordingly, 
given that the height of the husband is 6.3 feet, the probability that his wife has a 
height between 5.28 and 5.92 feet is 


P(5.28 < X_ <5.92|X1 = 6.3) = 8(2) — 6(—2) = 0.954. 


The interval (5.28, 5.92) could be thought of as a 95.4% prediction interval for the 
wife’s height, given X; = 6.3. m 
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3.5.3 *Applications 


In this section, we consider several applications of the multivariate normal distri- 
bution. These the reader may have already encountered in an applied course in 
statistics. The first is principal components, which results in a linear function of a 
multivariate normal random vector that has independent components and preserves 
the “total” variation in the problem. 

Let the random vector X have the multivariate normal distribution N,,(, 4) 
where © is positive definite. As in (3.5.8), write the spectral decomposition of © 
as & = I’AT. Recall that the columns, v1, v2,...,Vn, of I’ are the eigenvectors 
corresponding to the eigenvalues Aj, A2,..-,An that form the main diagonal of the 
matrix A. Assume without loss of generality that the eigenvalues are decreasing; 
ie, AV > Ag > ++: > An > 0. Define the random vector Y = I'(X — yp). Since 
TSI” = A, by Theorem 3.5.2 Y has a N,,(0, A) distribution. Hence the components 
Y,, Y2,...,Y, are independent random variables and, for 7 = 1,2,...,n, Y; has 
a N(0,A;) distribution. The random vector Y is called the vector of principal 
components. 

We say the total variation, (TV), of a random vector is the sum of the variances 
of its components. For the random vector X, because [ is an orthogonal matrix 


VO) =o, =e) Se P AP =e Arr’ = Y= TV(Y): 
i=1 


i=1 


Hence, X and Y have the same total variation. 

Next, consider the first component of Y, which is given by Y; = vi (X — 4). 
This is a linear combination of the components of X — yz with the property ||vi||? = 
i-1 Ui; = 1, because I’ is orthogonal. Consider any other linear combination of 
(X— 2), say a'(X— ps) such that |jal/? = 1. Because a € R” and {v1,...,vn} forms 


a basis for R", we must have a = Yi a;vj for some set of scalars a1,...,@n. 
Furthermore, because the basis {vj,...,Vn} is orthonormal 
/ 
n n 
a’v; = So ayy; vVi= So aviv =a; 
j=l jg=1 


Using (3.5.9) and the fact that \; > 0, we have the inequality 


Varia X) = a Za 
= So X(a'vi)? 
i=l 
= > Aa? <A. > a? = Ai = Var(¥). (3.5.27) 
i=l i=1 


Hence, Y, has the maximum variance of any linear combination a’(X — ys), such 
that ||a|| = 1. For this reason, Y; is called the first principal component of X. 
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What about the other components, Y2,...,Y¥;,? As the following theorem shows, 
they share a similar property relative to the order of their associated eigenvalue. 
For this reason, they are called the second, third, through the nth principal 
components, respectively. 


Theorem 3.5.5. Consider the situation described above. For 7 = 2,...,n and 
i=1,2,...,j-1, Varla’X] < A; = Var(Y;), for all vectors a such that a L v; and 
lal] = 1. 


The proof of this theorem is similar to that for the first principal component 
and is left as Exercise 3.5.20. A second application concerning linear regression is 
offered in Exercise 3.5.22. 


EXERCISES 


3.5.1. Let X and Y have a bivariate normal distribution with respective parameters 
Me = 2.8, fy = 110, 0% = 0.16, 07 = 100, and p = 0.6. Using R, compute: 


(a) P(106 < Y < 124). 
(b) P(106 < Y < 124|X = 3.2). 


3.5.2. Let X and Y have a bivariate normal distribution with parameters p= 
2 2 


3, Wg = 1, of = 16, of = 25, and p = 3. Using R, determine the following 
probabilities: 
(a) P(<Y <8). 
(b) P(3<Y < 8|X =7). 
(c) P(-3< X <3). 
(d) P(-3 < X < 3]Y = —4). 
3.5.3. Show that expression (3.5.4) is true. 
3.5.4. Let f(x,y) be the bivariate normal pdf in expression (3.5.1). 


(a) Show that f(z, y) has an unique maximum at (11, [/2). 


(b) For a given c > 0, show that the points {(x,y) : f(x,y) = c} of equal proba- 
bility form an ellipse. 


3.5.5. Let X be No(u, X). Recall expression (3.5.17) which gives the probability of 
an elliptical contour region for X. The R function® ellipmake plots the elliptical 
contour regions. To graph the elliptical 95% contour for a multivariate normal 
distribution with = (5,2)! and © with variances 1 and covariance 0.75, use the 
code 


8Part of this code was obtained from an annonymous author at the site 
http://stats.stackexchange.com/questions/9898/ 
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ellipmake (p=.95,b=matrix(c(1,.75,.75,1) ,nrow=2) ,mu=c(5,2)). 
This R function can be found at the site listed in the Preface. 
(a) Run the above code. 
(b) Change the code so the probability is 0.50. 
(c) Change the code to obtain an overlay plot of the 0.50 and 0.95 regions. 
(d) Using a loop, obtain the overlay plot for a vector of probabilities. 


3.5.6. Let U and V be independent random variables, each having a standard 
normal distribution. Show that the mgf E(e’YY)) of the random variable UV is 
ha Py aleve 1. 
Hint: Compare E(e!¥V) 
equal to zero. 


with the integral of a bivariate normal pdf that has means 


3.5.7. Let X and Y have a bivariate normal distribution with parameters 1 = 
5, je = 10, oF = 1, of = 25, and p> 0. WPA Y < 16x =5) = 0,954, 
determine p. 


3.5.8. Let X and Y have a bivariate normal distribution with parameters 4 = 
20, w2 = 40, 07 = 9, of =4, and p= 0.6. Find the shortest interval for which 0.90 
is the conditional probability that Y is in the interval, given that X = 22. 


3.5.9. Say the correlation coefficient between the heights of husbands and wives is 
0.70 and the mean male height is 5 feet 10 inches with standard deviation 2 inches, 
and the mean female height is 5 feet 4 inches with standard deviation 135 inches. 
Assuming a bivariate normal distribution, what is the best guess of the height of 
a woman whose husband’s height is 6 feet? Find a 95% prediction interval for her 
height. 


3.5.10. Let 
Feu) = (/2n) exp |—5(a? +92)] {1 + eves |—5(a? +9? -2)]}. 


where —oo < @ < ow, —co < y < ov. If f(x,y) is a joint pdf, it is not a normal 
bivariate pdf. Show that f(x,y) actually is a joint pdf and that each marginal pdf 
is normal. Thus the fact that each marginal pdf is normal does not imply that the 
joint pdf is bivariate normal. 


3.5.11. Let X, Y, and Z have the joint pdf 


3/2 2 2 2 2 2 2 
i 
a) eae a ieee || 2 
27 2 2 


where —oo < & < CO, —CO < y < ~w, and —o0 < z < oo. While X, Y, and Z are 
obviously dependent, show that X, Y, and Z are pairwise independent and that 
each pair has a bivariate normal distribution. 
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3.5.12. Let X and Y have a bivariate normal distribution with parameters yw, = 
U2 = 0, of = o3 = 1, and correlation coefficient p. Find the distribution of the 
random variable Z = aX + bY in which a and b are nonzero constants. 


3.5.13. Establish formula (3.5.11) by a direct multiplication. 


3.5.14. Let X = (X1, X2, X3) have a multivariate normal distribution with mean 
vector O and variance-covariance matrix 


1 0 0 
x=]0 2 1 
OL 2 


Find P(X > Xe+X3+ 2), 
Hint: Find the vector a so that aX = X, — X_9 — X3 and make use of Theorem 
3.5.2. 


3.5.15. Suppose X is distributed N,,(u,¥). Let X =n! Oe, Xi. 


(a) Write X as aX for an appropriate vector a and apply Theorem 3.5.2 to find 
the distribution of X. 


(b) Determine the distribution of X if all of its component random variables X; 
have the same mean [L. 


3.5.16. Suppose X is distributed No2(y, 8). Determine the distribution of the 
random vector (X1;+X2, X1— X2). Show that X;+X»2 and X,—X9 are independent 
if Var(X1) = Var(X92). 


3.5.17. Suppose X is distributed N3(0, =), where 
3 2 1 
SS} 2-2 A 
1 1 3 


Find P((X, — 2X2 + X3)? > 15.36). 


3.5.18. Let X 1, X2, X3 be iid random variables each having a standard normal 
distribution. Let the random variables Yi, Y2, Y3 be defined by 


X,=YicosYosinY3, Xo=Y ,sinYosinY3, X3 = Y, cos Y3, 


where 0 < Y; < oc, 0< Yo < 27, 0 < Y3 < a. Show that Yj, Yo, Y3 are mutually 
independent. 


3.5.19. Show that expression (3.5.9) is true. 
3.5.20. Prove Theorem 3.5.5. 


3.5.21. Suppose X has a multivariate normal distribution with mean 0 and covari- 
ance matrix 


283 215 277 208 
215 213 217 153 
277 217 336 236 
208 153 236 194 
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(a) Find the total variation of X. 
(b) Find the principal component vector Y. 


(c) Show that the first principal component accounts for 90% of the total varia- 
tion. 


(d) Show that the first principal component Y; is essentially a rescaled X. Deter- 
mine the variance of (1/2)X and compare it to that of Y;. 


Note that the R command eigen(amat) obtains the spectral decomposition of the 
matrix amat. 


3.5.22. Readers may have encountered the multiple regression model in a previous 
course in statistics. We can briefly write it as follows. Suppose we have a vector 
of n observations Y which has the distribution N,,(X, 071), where X is an n x p 
matrix of known values, which has full column rank p, and @ is a p x 1 vector of 
unknown parameters. The least squares estimator of @ is 


@ = (X'X)!X’Y. 
(a) Determine the distribution of 2. 
(b) Let Y = XQ. Determine the distribution of Y. 
(c) Let €= Y — Y. Determine the distribution of é. 


(d) By writing the random vector (¥’,@’)’ as a linear function of Y, show that 
the random vectors Y and @ are independent. 


(e) Show that B solves the least squares problem; that is, 


Y — X6|/? = min |/Y — Xb]. 
| ||" = min || | 


3.6 t- and F-Distributions 


It is the purpose of this section to define two additional distributions that are quite 
useful in certain problems of statistical inference. These are called, respectively, the 
(Student’s) t-distribution and the F-distribution. 


3.6.1 The t-distribution 


Let W denote a random variable that is N(0,1); let V denote a random variable 
that is y?(r); and let W and V be independent. Then the joint pdf of W and V, 
say h(w,v), is the product of the pdf of W and that of V or 


1_,-w? 1 =1 = 
ee Pa? Te v/2 -—wo<w<cn, 0O<v<@ 
elsewhere. 
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Define a new random variable T by writing 
WwW 
JV/r 


The transformation technique is used to obtain the pdf gi(t) of T. The equations 


(3.6.1) 


<< and u=v 
VJv/r 
define a transformation that maps S = {(w,v) : —oo < w < w, 0 <u < oo} 
one-to-one and onto T = {(t,u) : -co <t < w, 0 < u < oo}. Since w = 


t/u//r, v =u, the absolute value of the Jacobian of the transformation is |J| = 
Ju/V/r. Accordingly, the joint pdf of T and U = V is given by 


atu) = a(S u)iy| 


= sete? tex [-¥ (14 £)) 4 |t]|<co, O<u<co 
0 elsewhere. 


The marginal pdf of T is then 


nt) =f altudu 


[oe} 


1 u ie 
(r+1)/2-1 
—— exp |—~= (1+ —]}] du. 
[ V2arl(r/2)2"/2 | 2 ( r )| 


In this integral let z = u[1 + (¢?/r)|/2, and it is seen that 


[oe] 1 2z carnliecn —z 2 
g(t) = Vial (r/2)2°72 (=) e (7) dz 
M(r+U/2to 
aL (r/2) (1 + t2/r)@+1)/2? 


Thus, if W is N(0, 1), V is y?(r), and W and V are independent, then T = W/\/V/r 
has the pdf gi(t), (3.6.2). The distribution of the random variable T is usually 
called a t-distribution. It should be observed that a t-distribution is completely 
determined by the parameter r, the number of degrees of freedom of the random 
variable that has the chi-square distribution. 

The pdf gi(t) satisfies gi(—t) = gi(t); hence, the pdf of T is symmetric about 0. 
Thus, the median of T' is 0. Upon differentiating g(t), it follows that the unique 
maximum of the pdf occurs at 0 and that the derivative is continuous. So, the pdf is 
mound shaped. As the degrees of freedom approach oo, the ¢t-distribution converges 
to the N(0,1) distribution; see Example 5.2.3 of Chapter 5. 

The R command pt(t,r) computes the probability P(T’ < t) when T has a 
t-distribution with r degrees of freedom. For instance, the probability that a t- 
distributed random variable with 15 degrees of freedom is less than 2.0 is computed 


—-o <t<o. (3.6.2) 
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as pt(2.0,15), while the command qt(.975,15) returns the 97.5th percentile of 
this distribution. The R code t=seq(-4,4,.01) followed by plot (dt(t,3)~t) 
yields a plot of the t-pdf with 3 degrees of freedom. 

Before the age of modern computing, tables of the distribution of T were used. 
Because the pdf of T’ does depend on its degrees of freedom r, the usual t-table 
gives selected quantiles versus degrees of freedom. Table III in Appendix D is such 
a table. The following three lines of R code, however, produce this table. 

ps = c(.9,.925,.950,.975, .99, .995,.999); df = 1:30; tab=c() 

for(r in df){tab=rbind(tab,qt(ps,r))}; df=c(df, Inf) ;nq=qnorm(ps) 

tab=rbind(tab,nq) ;tab=cbind (df , tab) 
This code is the body of the R function ttable found at the site listed in the 
Preface. Due to the fact that t-distribution converges to the N(0,1) distribution, 
only the degrees of freedom from 1 to 30 are used in such tables. This is, also, the 
reason that the last line in the table are the standard normal quantiles. 


Remark 3.6.1. The ¢-distribution was first discovered by W. 5. Gosset when he 
was working for an Irish brewery. Gosset published under the pseudonym Student. 
Thus this distribution is often known as Student’s t-distribution. m 


Example 3.6.1 (Mean and Variance of the ¢-Distribution). Let the random variable 
T have a t-distribution with r degrees of freedom. Then, as in (3.6.1), we can write 
T =W/(V/r)~‘/?, where W has a N(0,1) distribution, V has a y?(r) distribution, 
and W and V are independent random variables. Independence of W and V and 
expression (3.3.8), provided (r/2) — (k/2) > 0 (ie., k <r), implies the following: 


—k/2 —k/2 
E(T*) = E|Ww* (=) = E(W*)E (=) | (3.6.3) 
= E aa ifk <r. (3.6.4) 


Because E(W) = 0, the mean of T is 0, as long as the degrees of freedom of T exceed 
1. For the variance, use / = 2 in expression (3.6.4). In this case the condition r > k 
becomes r > 2. Since E(W?) = 1, by expression (3.6.4), the variance of T is given 
by 


r 
r—-2 
Therefore, a ¢-distribution with r > 2 degrees of freedom has a mean of 0 and a 
variance of r/(r— 2). m 


Var(T) = E(T?) = 


(3.6.5) 


3.6.2. The F-distribution 


Next consider two independent chi-square random variables U and V having r; and 
ro degrees of freedom, respectively. The joint pdf h(u,v) of U and V is then 
yri/2-lyre/2-1e—(utv)/2 << uy < 00 


i 
h(u,v) = nee Ty +ra)/2 | ; 
elsewhere. 
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We define the new random variable 


_ U/ri 
a V/r2 


and we propose finding the pdf gi(w) of W. The equations 


_ u/ry 
ul’ 


E 


define a one-to-one transformation that maps the set S = {(u,v):0<u<0o0, 0< 
uv < oo} onto the set TJ = {(w,z): 0 < w< ow, 0 < z < oo}. Since u = 
(ri/r2)zw, v = z, the absolute value of the Jacobian of the transformation is 
|J| = (ri/r2)z. The joint pdf g(w, z) of the random variables W and Z = V is then 


ry—2 
1 ryzw\ 2 rg-2 z (TW T12 
eee Z —-~ {| —-+1)} — 
g(w, 2) P(r1/2)P (rg /2)201472)/2 ( T2 ) a exp | 2 ( r2 . )| rT." 


provided that (w,z) € T, and zero elsewhere. The marginal pdf gi(w) of W is then 


nw) =f glw,2)de 


—oo 


= i oie” aie exp 2 (rw +4 de. 
0 T(r1/2)0 (ro /2)201+2)/2 2 


If we change the variable of integration by writing 


z(ryw 
=—(—441), 
it can be seen that 


(w) = [ (71 /r2)"/2(w)"/2-1 Qy i ae 
i. fg P(r /2)0 (72/2) 201 472)/2 \ryw/rg +1 


x Z d 
ryw/ro +1 y 


Vi(ri tre) /2] (rz /re)"1/? wri/2-t 
RCD — Taya (9SUY<SO gg 
0 


elsewhere. 


l| 


Accordingly, if U and V are independent chi-square variables with r; and r2 
degrees of freedom, respectively, then W = (U/r1)/(V/r2) has the pdf gi (w), (3.6.6). 
The distribution of this random variable is usually called an F-distribution; and 
we often call the ratio, which we have denoted by W, F. That is, 


= U/ry 
— V/ra 


(3.6.7) 


It should be observed that an F-distribution is completely determined by the two 
parameters r; and ro. 
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In terms of R computation, the command pf (2.50,3,8) computes to the value 
0.8665 which is the probability P(F < 2.50) when F has the F-distribution with 3 
and 8 degrees of freedom. The 95th percentile of F' is qf (.95,3,8) = 4.066 and 
the code x=seq(.01,5, .01) ;plot (df (x,3,8)~x) draws a plot of the pdf of this 
F random variable. Note that the pdf is right-skewed. Before the age of modern 
computation, tables of the quantiles of F-distributions for selected probabilities 
and degrees of freedom were used. Table IV in Appendix D displays the 95th and 
99th quantiles for selected degrees of freedom. Besides its use in statistics, the 
F-distribution is used to model lifetime data; see Exercise 3.6.13. 


Example 3.6.2 (Moments of F-Distributions). Let F' have an F-distribution with 
ry; and rg degrees of freedom. Then, as in expression (3.6.7), we can write F = 
(r2/r1)(U/V), where U and V are independent y? random variables with r; and re 
degrees of freedom, respectively. Hence, for the kth moment of F’, by independence 
we have 


E(F*) = (2) (U') E(V-*), 


provided, of course, that both expectations on the right side exist. By Theorem 
3.3.2, because k > —(1r1/2) is always true, the first expectation always exists. The 
second expectation, however, exists if rg > 2k; i.e., the denominator degrees of 
freedom must exceed twice k. Assuming this is true, it follows from (3.3.8) that the 
mean of F is given by 


= (3.6.8) 


If rg is large, then E(F’) is about 1. In Exercise 3.6.7, a general expression for 
E(F*) is derived. m 
3.6.3 Student’s Theorem 


Our final note in this section concerns an important result for the later chapters on 
inference for normal random variables. It is a corollary to the t-distribution derived 
above and is often referred to as Student’s Theorem. 


Theorem 3.6.1. Let X),...,Xn be tid random variables each having a normal 
distribution with mean and variance o?. Define the random variables 


X= ie Xs and S? = Ay Ye (Xs — X)?. 


Then 


(a) X has a N (u. =) distribution. 


(b) X and S? are independent. 


(c) (n—1)S?/o? has a x?(n — 1) distribution. 
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(d) The random variable 


X— 4p 
T =~ 
S//n 
has a Student t-distribution with n — 1 degrees of freedom. 


(3.6.9) 


Proof: Note that we have proved part (a) in Corollary 3.4.1. Let X = (Xj,...,Xn)’. 
Because X1,..., Xp are iid N(, 07) random variables, X has a multivariate normal 
distribution N(1,07I), where 1 denotes a vector whose components are all 1. Let 
v’ = (1/n,...,1/n) = (1/n)1'. Note that X = v’X. Define the random vector Y 


( 
by Y = (X, —X,...,X,—X)!. Consider the following transformation: 


We | a | = | eo |x. (3.6.10) 


Because W is a linear transformation of multivariate normal random vector, by 
Theorem 3.5.2 it has a multivariate normal distribution with mean 


=| -* _| # 
E[W] = | I—1v’ Jia = | 0, | (3.6.11) 
where O,, denotes a vector whose components are all 0, and covariance matrix 


>») 


| 
-— 
— 
< 
_ 
< 
————| 
Q 
iS) 
| 
-— 
— 
< 
ae 
—— 


(3.6.12) 


| 
Q 
ie) 
| a | 
Par 
= 
lo 
ms 
< 
uN 


Because X is the first component of W, we can also obtain part (a) by Theo- 


rem 3.5.1. Next, because the covariances are 0, X is independent of Y. But 
S? =(n—1)7'Y’Y. Hence, X is independent of $?, also. Thus part (b) is true. 
Consider the random variable 


n 2 
v=y (=). 
i=1 


Each term in this sum is the square of a N(0,1) random variable and, hence, has 
a x7(1) distribution (Theorem 3.4.1). Because the summands are independent, it 
follows from Corollary 3.3.1 that V is a x?(n) random variable. Note the following 


identity: 
eo pees = 


- EG) +Ge) 


_ a + (=) . (3.6.13) 


+ 


2 
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By part (b), the two terms on the right side of the last equation are independent. 
Further, the second term is the square of a standard normal random variable and, 
hence, has a y?(1) distribution. Taking mgfs of both sides, we have 


(1—2t)-"/? = E [exp{t(n — 1)S?/o?}] (1 — 2t)-1/?. (3.6.14) 


Solving for the mgf of (n — 1)$?/c? on the right side we obtain part (c). Finally, 
part (d) follows immediately from parts (a)—(c) upon writing T, (3.6.9), as 


(X= 4)/(o/Vn) 


eee ee a 


V (n= 1)S?/(o?(n — 1) 


EXERCISES 


3.6.1. Let T have a t-distribution with 10 degrees of freedom. Find P(|T'| > 2.228) 
from either Table III or by using R. 


3.6.2. Let T have a t-distribution with 14 degrees of freedom. Determine 6 so that 
P(—b< T <b) =0.90. Use either Table II or by using R. 


3.6.3. Let T have a t-distribution with r > 4 degrees of freedom. Use expression 
(3.6.4) to determine the kurtosis of T. See Exercise 1.9.15 for the definition of 
kurtosis. 


3.6.4. Using R, plot the pdfs of the random variables defined in parts (a)—(e) below. 
Obtain an overlay plot of all five pdfs, also. 


(a) X has a standard normal distribution. Use this code: 
x=seq(-6,6,.01); plot(dnorm(x)~x). 


(b) X has a t-distribution with 1 degree of freedom. Use the code: 
lines (dt (x,1)~x,1ty=2). 


(c) X has a t-distribution with 3 degrees of freedom. 
(d) X has a t-distribution with 10 degrees of freedom. 
(e) X has a t-distribution with 30 degrees of freedom. 


3.6.5. Using R, investigate the probabilities of an “outlier” for a t-random variable 
and a normal random variable. Specifically, determine the probability of observing 
the event {|X| > 2} for the following random variables: 


(a) X has a standard normal distribution. 
(b) X has a t-distribution with 1 degree of freedom. 
(c) X has a ¢-distribution with 3 degrees of freedom. 


(d) X has a t-distribution with 10 degrees of freedom. 
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(e) X has a t-distribution with 30 degrees of freedom. 


3.6.6. In expression (3.4.13), the normal location model was presented. Often real 
data, though, have more outliers than the normal distribution allows. Based on 
Exercise 3.6.5, outliers are more probable for t-distributions with small degrees of 
freedom. Consider a location model of the form 


X=prte, 


where e has a ¢-distribution with 3 degrees of freedom. Determine the standard 
deviation o of X and then find P(|X — pu] >). 


3.6.7. Let F have an F-distribution with parameters r; and rg. Assuming that 
rg > 2k, continue with Example 3.6.2 and derive the E(F*). 


3.6.8. Let F have an F-distribution with parameters r; and r2. Using the results 
of the last exercise, determine the kurtosis of F’, assuming that rg > 8. 


3.6.9. Let F have an F-distribution with parameters r; and rg. Argue that 1/F 
has an F-distribution with parameters r2 and rj. 


3.6.10. Suppose F has an F-distribution with parameters r; = 5 and r2 = 10. 
Using only 95th percentiles of F’-distributions, find a and b so that P(F < a) = 0.05 
and P(F < b) = 0.95, and, accordingly, P(a < F < b) = 0.90. 

Hint: Write P(F < a) = P(1/F > 1/a) =1- P(1/F < 1/a), and use the result 
of Exercise 3.6.9 and R. 


3.6.11. Let T = W/,/V/r, where the independent variables W and V are, re- 
spectively, normal with mean zero and variance 1 and chi-square with r degrees of 
freedom. Show that T? has an F-distribution with parameters rj = 1 and rg = r. 
Hint: What is the distribution of the numerator of T?? 


3.6.12. Show that the ¢-distribution with r = 1 degree of freedom and the Cauchy 
distribution are the same. 


3.6.13. Let F have an F-distribution with 2r and 2s degrees of freedom. Since 
the support of F is (0,00), the F-distribution is often used to model time until 
failure (lifetime). In this case, Y = log F' is used to model the log of lifetime. The 
log F' family is a rich family of distributions consisting of left- and right-skewed 
distributions as well as symmetric distributions; see, for example, Chapter 4 of 
Hettmansperger and McKean (2011). In this exercise, consider the subfamily where 
Y = log F and F has 2 and 2s degrees of freedom. 


(a) Obtain the pdf and cdf of Y. 


(b) Using R, obtain a page of plots of these distributions for s = .4, .6, 1.0, 2.0, 4.0, 8. 
Comment on the shape of each pdf. 


(c) For s = 1, this distribution is called the logistic distribution. Show that the 
pdf is symmetric about 0. 
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3.6.14. Show that A 


~ T+ (r1/r2)W’ 


where W has an F-distribution with parameters 7; and re, has a beta distribution. 


Y 


3.6.15. Let X1, X»2 be iid with common distribution having the pdf f(x) = 
e-*, 0< a < , zero elsewhere. Show that Z = X,/Xg has an F-distribution. 


3.6.16. Let X1, X2, and X3 be three independent chi-square variables with r,, ro, 
and r3 degrees of freedom, respectively. 


(a) Show that Y; = X,/X»2 and Y2 = X, + Xo are independent and that Y2 is 
x7(r1 +12). 


(b) Deduce that 
X1/r1 aiid X3/r3 
Xo/r2 (X1 + X2)/(r1 + 12) 


are independent F’-variables. 


3.7 *Mixture Distributions 


Recall the discussion on the contaminated normal distribution given in Section 
3.4.1. This was an example of a mixture of normal distributions. In this section, we 
extend this to mixtures of distributions in general. Generally, we use continuous- 
type notation for the discussion, but discrete pmfs can be handled the same way. 

Suppose that we have k distributions with respective pdfs fi (x), fo(x),..., fx(x), 
with supports S1,S2,...,S%, means fi1,2,..., fe, and variances o?,03,...,07, 
with positive mixing probabilities p,, p2,...,px, where py + po +--:+p, = 1. Let 
S = U*_,S; and consider the function 


" 
f(x) = pifi(x) + pofo(a) +-+-+ pefa(a) = > Pifi(a); xeES. (3.7.1) 


Note that f(x) is nonnegative and it is easy to see that it integrates to one over 
(—oo, 00); hence, f(a) is a pdf for some continuous-type random variable X. Inte- 
grating term-by-term, it follows that the cdf of X is: 


k 
F(a) = > PiFi(x); res, (3.7.2) 


where F;(2) is the cdf corresponding to the pdf f;(#). The mean of X is given by 


k sS k 
E(X) = dn | af(ede=S pie =p, (3.7.3) 
i=l =9 i=l 
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a weighted average of j11, Jl2,..-, Uk, and the variance equals 
k fore 
var(X) = Som f (e—m)hla)ae 
i=1 = 00. 
k fore 
= Ya f (e-n) + - MP Ale) ae 
i=1 =e 


k co k lore) 
= dw | x — ps)” fi() a+ piles 1? | fi(x) da, 
i=1 OO, i=1 oe 


because the cross-product terms integrate to zero. That is, 


k k 
var(X) = S~ pio? + S~ pis — )?. (3.7.4) 
i=l w=1 


Note that the variance is not simply the weighted average of the k variances, but it 
also includes a positive term involving the weighted variance of the means. 


Remark 3.7.1. It is extremely important to note these characteristics are as- 
sociated with a mixture of k distributions and have nothing to do with a linear 
combination, say > a;X;, of k random variables. m 


For the next example, we need the following distribution. We say that X has a 
loggamma pdf with parameters a > 0 and @ > 0 if it has pdf 


{ Tee g—“1+)/8 (log z)°-! g¢>1 
0 


fi(z) = (3.7.5) 


elsewhere. 


The derivation of this pdf is given in Exercise 3.7.1, where its mean and variance 
are also derived. We denote this distribution of X by logI’(a, {). 


Example 3.7.1. Actuaries have found that a mixture of the loggamma and gamma 
distributions is an important model for claim distributions. Suppose, then, that X1 
is log I'(ay1, 31), X2 is T'(a2, 62), and the mixing probabilities are p and (1 — p). 
Then the pdf of the mixture distribution is 


rage el O0<¢<1 
2 
= p11» (Bi +1)/8 1— a2—-1,—-2/8 
f(x) = Fray os #)™ ga (Fa MES tare ae e-t/82 Leg 

0 elsewhere. 
(3.7.6) 

Provided 3, < 2~', the mean and the variance of this mixture distribution are 
w= pil—fi) + (1—pjarbe (3.7.7) 

a = p[(l- 261)" --f1)-™] 

+ (1 p)a263 + p(1 — p)[(1 — 61) — a2)’; (3.7.8) 


see Exercise 3.7.3. @ 
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The mixture of distributions is sometimes called compounding. Moreover, it 
does not need to be restricted to a finite number of distributions. As demonstrated 
in the following example, a continuous weighting function, which is of course a pdf, 
can replace pj, p2,..., PK; 1.e., integration replaces summation. 


Example 3.7.2. Let X9 be a Poisson random variable with parameter 6. We want 
to mix an infinite number of Poisson distributions, each with a different value of 6. 
We let the weighting function be a pdf of 6, namely, a gamma with parameters a 
and @. For = 0,1,2,..., the pmf of the compound distribution is 


wor = [fpr] 


1 co 
= gete—-1,-9(1+8)/8 gg 
nora | 


T(a 4+ 2)G* 
T(a)a!(1 + Bor’ 


where the third line follows from the change of variable t = (1 + 3)/( to solve the 
integral of the second line. 

An interesting case of this compound occurs when a = r, a positive integer, and 
8 = (1—p)/p, where 0 < p < 1. In this case the pmf becomes 


(pea iip (L =p) 


DS a ca 0 el OP ee 


x! : 
That is, this compound distribution is the same as that of the number of excess 
trials needed to obtain r successes in a sequence of independent trials, each with 
probability p of success; this is one form of the negative binomial distribution. 
The negative binomial distribution has been used successfully as a model for the 
number of accidents (see Weber, 1971). m 


In compounding, we can think of the original distribution of X as being a con- 
ditional distribution given 0, whose pdf is denoted by f(z|9). Then the weighting 
function is treated as a pdf for 0, say g(@). Accordingly, the joint pdf is f(x|)g(@), 
and the compound pdf can be thought of as the marginal (unconditional) pdf of X, 


nx) = | (0) (210) a, 
where a summation replaces integration in case @ has a discrete distribution. For 
illustration, suppose we know that the mean of the normal distribution is zero but 
the variance o? equals 1/@ > 0, where 6 has been selected from some random model. 
For convenience, say this latter is a gamma distribution with parameters a and (. 
Thus, given that 0, X is conditionally N(0,1/0) so that the joint distribution of X 


and @ is 
Je —Ox? 1 er 
f(2|@)g(@) = = exp ()| Eero exp(—0/8)| , 
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for —co <u <oo, 0<0@< co. Therefore, the marginal (unconditional) pdf h(2) 
of X is found by integrating out 9; that is, 


oo gotl/2-1 x 1 
rere [ee [0242] a 
(2) 0 B%/2nT(a) 2 £ 
By comparing this integrand with a gamma pdf with parameters a+ 3 and [(1/3)+ 
(x? /2)|~+, we see that the integral equals 


_,_ Tat) ( 28 _ : 
hs) = Fo Tal (a) 2+ BaP ; o<zr<cw. 


It is interesting to note that if a = r/2 and G = 2/r, where r is a positive integer, 
then X has an unconditional distribution, which is Student’s t, with r degrees of 
freedom. That is, we have developed a generalization of Student’s distribution 
through this type of mixing or compounding. We note that the resulting distribution 
(a generalization of Student’s t) has much thicker tails than those of the conditional 
normal with which we started. 

The next two examples offer two additional illustrations of this type of com- 
pounding. 


Example 3.7.3. Suppose that we have a binomial distribution, but we are not 
certain about the probability p of success on a given trial. Suppose p has been 
selected first by some random process that has a beta pdf with parameters a and 
GB. Thus X, the number of successes on n independent trials, has a conditional 
binomial distribution so that the joint pdf of X and p is 


nl x n-2 Tia 8) a—1 -1 
(P (1 — p) Tara)? (1—p)?*, 


x\(n — x) 
for c = 0,1,...,n, 0<p <1. Therefore, the unconditional pmf of X is given by 
the integral 


P(x\p)g(p) = 


x = : nT (a 7 B) rta—1 er n—-2r+B-1 
Me). = [arose Ca - 


nil(a+ ple +a) (n—¢ + 8) 


Haar eOrernter ps Pee 


Now suppose a and 3 are positive integers; since [(k) = (k —1)!, this unconditional 
(marginal or compound) pdf can be written 


2. Cee =e ae Dee ay _ 
Me) =n ola Die ater oo De t= 0,1, Qetn5 


Because the conditional mean E(X|p) = np, the unconditional mean is na/(a+ (3) 
since E(p) equals the mean a/(a + (2) of the beta distribution. m 
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Example 3.7.4. In this example, we develop by compounding a heavy-tailed 
skewed distribution. Assume X has a conditional gamma pdf with parameters 
k and @~+. The weighting function for @ is a gamma pdf with parameters a and (3. 
Thus the unconditional (marginal or compounded) pdf of X is 


n( ’ a a a] i 
vt) = |) 
o | P(e) D(k) 
co .k-1lgatk—1 ; 
= | ono ec 90+82)/8 Gg. 
0 


BeT(a)P(k) 


Comparing this integrand to the gamma pdf with parameters a+k and 3/(14+ Bx) 
we see that 


d 


T k k pk-1 

A(x) = ee. 
D(a) P(k)(1 + Bajor* 

which is the pdf of the generalized Pareto distribution (and a generalization of 


the F' distribution). Of course, when k = 1 (so that X has a conditional exponential 
distribution), the pdf is 


0<2r<m, 


h(x) = aB(1+ Br)-°), O<2<00, 


which is the Pareto pdf. Both of these compound pdfs have thicker tails than the 
original (conditional) gamma distribution. 

While the cdf of the generalized Pareto distribution cannot be expressed in a 
simple closed form, that of the Pareto distribution is 


Hay = f oB(1 + Bt) °F) dt =1-—(14+ Bx)-*%, 0<a<om. 


From this, we can create another useful long-tailed distribution by letting X = Y’, 
0<v7. Thus Y has the cdf 


Gy) = PY <y) = PIX" <yl=P[X <y’]. 
Hence, this probability is equal to 
Gly) = Hy") =1-(1+ By’) %, O<y<ov, 


with corresponding pdf 


ey=t)= 


——— ————,, “() A 
(I+ Byer? © SYS 


We call the associated distribution the transformed Pareto distribution or the 
Burr distribution (Burr, 1942), and it has proved to be a useful one in modeling 
thicker-tailed distributions. 
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EXERCISES 


3.7.1. Suppose Y has a I'(a, 3) distribution. Let X = e’. Show that the pdf of 
X is given by expression (3.7.5). Determine the cdf of X in terms of the cdf of a 
T-distribution. Derive the mean and variance of X. 


3.7.2. Write R functions for the pdf and cdf of the random variable in Exercise 
3.7.1. 


3.7.3. In Example 3.7.1, derive the pdf of the mixture distribution given in expres- 
sion (3.7.6), then obtain its mean and variance as given in expressions (3.7.7) and 
(3.7.8). 


3.7.4. Using the R function for the pdf in Exercise 3.7.2 and dgamma, write an R 
function for the mixture pdf (3.7.6). For a = 3 = 2, obtain a page of plots of this 
density for p = 0.05, 0.10, 0.15 and 0.20. 


3.7.5. Consider the mixture distribution (9/10) N(0,1) + (1/10)N(0, 9). Show that 
its kurtosis is 8.34. 


3.7.6. Let X have the conditional geometric pmf 6(1—6)*~!, « = 1,2,..., where 0 
is a value of a random variable having a beta pdf with parameters a and 3. Show 
that the marginal (unconditional) pmf of X is 


T(ia+ @)T(at+1)P(6+ 2-1) 
T@r(@rat+B+a) 9°” 

If a = 1, we obtain 
Bo 
(O-+2)(e+a—1)" 


which is one form of Zipf’s law. 


sl ee 


3.7.7. Repeat Exercise 3.7.6, letting X have a conditional negative binomial dis- 
tribution instead of the geometric one. 


3.7.8. Let X have a generalized Pareto distribution with parameters k, a, and (. 
Show, by change of variables, that Y = GX/(1+ GX) has a beta distribution. 


3.7.9. Show that the failure rate (hazard function) of the Pareto distribution is 


h(a) - a 
1—H(x) Bl4+a 


Find the failure rate (hazard function) of the Burr distribution with cdf 


1 ped 
G(y) =1- (——) , 0<y<oo. 
mi (sar) _ 


In each of these two failure rates, note what happens as the value of the variable 
increases. 
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3.7.10. For the Burr distribution, show that 
1 k k 
k =—_ a _ 
E(X \= get (a =)r (£41) /re, 


3.7.11. Let the number X of accidents have a Poisson distribution with mean 0. 
Suppose A, the liability to have an accident, has, given 0, a gamma pdf with pa- 
rameters a = h and 3 = h7!; and @, an accident proneness factor, has a generalized 
Pareto pdf with parameters a, \ = h, and k. Show that the unconditional pdf of 
X is 


provided k < at. 


Tiat+k)P(at+h)P(ath+k)T(h+k)l(k + 2) 6415 
T(a)T(at+k+h)P(hyl(k)T(ath+k+a)2! ’ vee? 
sometimes called the generalized Waring pmf. 


3.7.12. Let X have a conditional Burr distribution with fixed parameters @ and 7, 
given parameter a. 


(a) If aw has the geometric pmf p(1 — p)*, a = 0,1,2,..., show that the uncondi- 
tional distribution of X is a Burr distribution. 


(b) If a has the exponential pdf G~'e~°/°, a > 0, find the unconditional pdf of 
Xx. 


3.7.13. Let X have the conditional Weibull pdf 
f(2|0) = Ora™-te-®*", 0 <4 <00, 


and let the pdf (weighting function) g(@) be gamma with parameters a and 3. Show 
that the compound (marginal) pdf of X is that of Burr. 


3.7.14. If X has a Pareto distribution with parameters a and ( and if c is a positive 
constant, show that Y = cX has a Pareto distribution with parameters a and G/c. 


Chapter 4 


Some Elementary Statistical 
Inferences 


4.1 Sampling and Statistics 


In Chapter 2, we introduced the concepts of samples and statistics. We continue 
with this development in this chapter while introducing the main tools of inference: 
confidence intervals and tests of hypotheses. 

In a typical statistical problem, we have a random variable X of interest, but its 
pdf f(#) or pmf p(#) is not known. Our ignorance about f(a) or p(#) can roughly 
be classified in one of two ways: 


1. f(x) or p(x) is completely unknown. 


2. The form of f(a) or p(x) is known down to a parameter 0, where 9 may be a 
vector. 


For now, we consider the second classification, although some of our discussion 
pertains to the first classification also. Some examples are the following: 


(a) X has an exponential distribution, Exp(6@), (3.3.6), where @ is unknown. 


(b) X has a binomial distribution b(n, p), (3.1.2), where n is known but p is 
unknown. 


(c) X has a gamma distribution I'(a, 3), (3.3.2), where both a and 6 are unknown. 


(d) X has a normal distribution N(,07), (3.4.6), where both the mean yw and 
the variance o? of X are unknown. 


We often denote this problem by saying that the random variable X has a density 
or mass function of the form f(x; @) or p(x; 0), where 6 € Q for a specified set Q. For 
example, in (a) above, 2 = {0|@ > 0}. We call 0 a parameter of the distribution. 
Because @ is unknown, we want to estimate it. 
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In this process, our information about the unknown distribution of X or the 
unknown parameters of the distribution of X comes from a sample on X. The 
sample observations have the same distribution as X, and we denote them as the 
random variables X1, X2,...,Xn, where n denotes the sample size. When the 
sample is actually drawn, we use lower case letters 71, %2,...,%y as the values 
or realizations of the sample. Often we assume that the sample observations 
X 1, X2,...,X»p are also mutually independent, in which case we call the sample a 
random sample, which we now formally define: 


Definition 4.1.1. If the random variables X,,X2,...,Xn are independent and 
identically distributed (iid), then these random variables constitute a random sam- 
ple of size n from the common distribution. 


Often, functions of the sample are used to summarize the information in a sam- 
ple. These are called statistics, which we define as: 


Definition 4.1.2. Let X,, X2,..., Xn denote a sample on a random variable X. Let 
T =T(X1, Xo,...,Xn) be a function of the sample. Then T is called a statistic. 


Once the sample is drawn, then t is called the realization of JT, where t = 
T(a1,%2,...,Up) and #1, %9,...,2%p is the realization of the sample. 


4.1.1 Point Estimators 


Using the above terminology, the problem we discuss in this chapter is phrased as: 
Let X1, X2,..., Xn denote a random sample on a random variable X with a density 
or mass function of the form f(x; 6) or p(a;0), where 6 € 2 for a specified set 2. In 
this situation, it makes sense to consider a statistic JT’, which is an estimator of 0. 
More formally, T is called a point estimator of 6. While we call T an estimator 
of 6, we call its realization t an estimate of 6. 

There are several properties of point estimators that we discuss in this book. 
We begin with a simple one, unbiasedness. 


Definition 4.1.3 (Unbiasedness). Let X1, X2,...,Xn denote a sample on a random 
variable X with pdf f(x;@), 0 € Q. Let T = T(X1, Xo,...,Xn) be a statistic. We 
say that T is an unbiased estimator of 0 if E(T) = 0. 


In Chapters 6 and 7, we discuss several theories of estimation in general. The 
purpose of this chapter, though, is an introduction to inference, so we briefly discuss 
the maximum likelihood estimator (mle) and then use it to obtain point esti- 
mators for some of the examples cited above. We expand on this theory in Chapter 
6. Our discussion is for the continuous case. For the discrete case, simply replace 
the pdf with the pmf. 

In our problem, the information in the sample and the parameter @ are involved 
in the joint distribution of the random sample; i-e., []j_, f(2i;0). We want to view 
this as a function of 6, so we write it as 


L(0) = L(0;21,22,...,2n) = [] fei). (4.1.1) 
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This is called the likelihood function of the random sample. As an estimate of 
0, a measure of the center of L(@) seems appropriate. An often-used estimate is 
the value of 6 that provides a maximum of L(6). If it is unique, this is called the 


maximum likelihood estimator (mle), and we denote it as 0; i.e., 
6 = Argmax L(6). (4.1.2) 


In practice, it is often much easier to work with the log of the likelihood, that 
is, the function 1(@) = log L(@). Because the log is a strictly increasing function, 
the value that maximizes 1(@) is the same as the value that maximizes L(6). Fur- 
thermore, for most of the models discussed in this book, the pdf (or pmf) is a 
differentiable function of #, and frequently @ solves the equation 


al(6) 


If @ is a vector of parameters, this results in a system of equations to be solved 
simultaneously; see Example 4.1.3. These equations are often referred to as the mle 
estimating equations, (EE). 

As we show in Chapter 6, under general conditions, mles have some good prop- 
erties. One property that we need at the moment concerns the situation where, 
besides the parameter 0, we are also interested in the parameter 7 = g(0) for a 
specified function g. Then, as Theorem 6.1.2 of Chapter 6 shows, the mle of 77 is 
7 = g(0), where @ is the mle of 6. We now proceed with some examples, including 
data realizations. 


Example 4.1.1 (Exponential Distribution). Suppose the common pdf of the ran- 
dom sample X1, X2,...,Xn is the [(1,0) density f(x) = 0~1 exp{—a/6} with sup- 
port 0 < a < o«; see expression (3.3.2). This gamma distribution is often called the 
exponential distribution. The log of the likelihood function is given by 


n 1 : nm 
1(@) = log | J a = —nlogé — gt a Xj. 
i=l 


i=1 


The first partial of the log-likelihood with respect to @ is 


Setting this partial to 0 and solving for 0, we obtain the solution %. There is only one 
critical value and, furthermore, the second partial of the log-likelihood evaluated 
at & is strictly negative, verifying that it provides a maximum. Hence, for this 
example, the statistic 0 = X is the mle of 0. Because E(X) = 0, we have that 
E(X) = @ and, hence, @ is an unbiased estimator of 0. 

Rasmussen (1992), page 92, presents a data set where the variable of interest 
X is the number of operating hours until the first failure of air-conditioning units 
for Boeing 720 airplanes. A random sample of size n = 13 was obtained and its 
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realized values are: 
359 413 25 130 90 50 50 487 102 194 55 74 97 
For instance, 359 hours is the realization of the random variable X;. The data 
range from 25 to 487 hours. Assuming an exponential model, the point estimate of 
@ discussed above is the arithmetic average of this data. Assuming that the data 
set is stored in the R vector ophrs, this average is computed in R by 
mean(ophrs); 163.5385 
Hence our point estimate of 0, the mean of X, is 163.54 hours. How close is 163.54 
hours to the true 6? We provide an answer to this question in the next section. 


Example 4.1.2 (Binomial Distribution). Let X be one or zero if, respectively, the 
outcome of a Bernoulli experiment is success or failure. Let 6, 0 < 6 < 1, denote 
the probability of success. Then by (3.1.1), the pmf of X is 


p(x; 6) = 6°(1-6)'-*, 2x=O0orl. 


If X,, X2,..., Xn is a random sample on X, then the likelihood function is 


L(@) = [[o@:4) _ QU ta1 Fi (1 Q)P— Detar 7 x; —Oor 1. 


i=l 


Taking logs, we have 


(0) = ae? logé + (: _ y-) log(1—6), a; =O0or 1. 
i=1 i=1 
The partial derivative of (0) is 


Ol(A) = Gj nT Yi Li 


00 0 1-0 


Setting this to 0 and solving for 0, we obtain @=n! 1 Xi = X; ie., the mle 
is the proportion of successes in the n trials. Because E(X) = 0, @ is an unbiased 
estimator of 6. 

Devore (2012) discusses a study involving ceramic hip replacements which for 
some patients can be squeaky; see, also, page 30 of Kloke and McKean (2014). 
In this study, 28 out of 143 hip replacements squeaked. In terms of the above 
discussion, we have a realization of a sample of size n = 143 from a binomial 
distribution where success is a hip replacement that squeaks and failure is one that 
does not squeak. Let @ denote the probability of success. Then our estimate of 0 
based on this sample is 6 = 28/143 = 0.1958. This is straightforward to calculate 
but, for later use, the R code prop.test (28,143) calculates this proportion. 


Example 4.1.3 (Normal Distribution). Let X have a N(u,o7) distribution with 
the pdf given in expression (3.4.6). In this case, 0 is the vector 90 = (y,¢). If 
X 1, X2,...,Xp is a random sample on X, then the log of the likelihood function 
simplifies to 


i<Vfe=n\ 
liu.) = ~Slog2n—nlogs ~ 550 (4 7 “) : (4.1.4) 
i=1 
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The two partial derivatives simplify to 


ge F(t) as 


i=1 
Ol(u,o) n 1c 9 
=. = te D(a —p). (4.1.6) 


Setting these to 0 and solving simultaneously, we see that the mles are 
ip = X (4.1.7) 


o = a yay. (4.1.8) 
w=1 


Notice that we have used the property that the mle of G? is the mle of o squared. As 
we have shown in Chapter 2, (2.8.6), the estimator X is an unbiased estimator for 
pu. Further, from Example 2.8.7 of Section 2.8 we know that the following statistic 


n 


Ga. 0 (Xi - X)? (4.1.9) 


n—-14 
i=1 


is an unbiased estimator of 07. Thus for the mle of o?, E(@?) = [n/(n — 1)]o?. 
Hence, the mle is a biased estimator of o?. Note, though, that the bias of G? is 
E(e? — 0?) = —o?/n, which converges to 0 as n — oo. In practice, however, S? is 
the preferred estimator of o?. 

Rasmussen (1991), page 65, discusses a study to measure the concentration of 
sulfur dioxide in a damaged Bavarian forest. The following data set is the realization 
of a random sample of size n = 24 measurements (micro grams per cubic meter) of 
this sulfur dioxide concentration: 

33.4 38.6 41.7 43.9 44.4 45.3 46.1 47.6 50.0 52.4 52.7 53.9 

54.3 55.1 56.4 56.5 60.7 61.8 62.2 63.4 65.5 66.6 70.0 71.5. 
These data are also in the R data file sulfurdio.rdaat the site listed in the Preface. 
Assuming these data are in the R vector sulfurdioxide, the following R segment 
obtains the estimates of the true mean and variance (both s? and G? are computed): 

mean (sulfurdioxide) ;var(sulfurdioxide) ; (23/24) *var (sulfurdioxide) 

53.91667 101.4797 97 .25139. 
Hence, we estimate the true mean concentration of sulfur dioxide in this damaged 
Bavarian forest to be 53.92 micro grams per cubic meter. The realization of the 
statistic S? is s? = 101.48, while the biased estimate of a? is 97.25. Rasmussen notes 
that the average concentration of sulfur dioxide in undamaged areas of Bavaria is 
20 micro grams per cubic meter. This value appears to be quite distant from the 
sample values. This will be discussed statistically in later sections. m 


In all three of these examples, standard differential calculus methods led us to 
the solution. For the next example, the support of the random variable involves @ 
and, hence, it is not surprising that for this case differential calculus is not useful. 
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Example 4.1.4 (Uniform Distribution). Let X1,...,X, be iid with the uniform 
(0,0) density; ic., f(a) = 1/0 for 0 < x < 6, 0 elsewhere. Because @ is in the 
support, differentiation is not helpful here. The likelihood function can be written 
as 

L(6) = 6-"T(max{zx;}, 6), 


where J(a,b) is 1 or 0 if a < b or a > b, respectively. The function L(@) is a 
decreasing function of 0 for all 6 > max{z;} and is 0 otherwise [sketch the graph 
of L(0)]. So the maximum occurs at the smallest value that 6 can assume; i.e., the 
mle is 6 = max{X;}. — 


4.1.2 Histogram Estimates of pmfs and pdfs 


Let X1,...,X, be a random sample on a random variable X with cdf F(x). In this 
section, we briefly discuss a histogram of the sample, which is an estimate of the pmf, 
p(x), or the pdf, f(x), of X depending on whether X is discrete or continuous. Other 
than X being a discrete or continuous random variable, we make no assumptions 
on the form of the distribution of X. In particular, we do not assume a parametric 
form of the distribution as we did for the above discussion on maximum likelihood 
estimates; hence, the histogram that we present is often called a nonparametric 
estimator. See Chapter 10 for a general discussion of nonparametric inference. We 
discuss the discrete situation first. 


The Distribution of X Is Discrete 


Assume that X is a discrete random variable with pmf p(a). Let X1,...,Xn be 
a random sample on X. First, suppose that the space of X is finite, say, D = 


{d1,.--,@m}. An intuitive estimate of p(a;) is the relative frequency of a; in the 
sample. We express this more formally as follows. For 7 = 1,2,...,m, define the 
statistics 


1 Xi = a; 


Then our intuitive estimate of p(a;) can be expressed by the sample average 
x 1 
B(aj) = — > 1; (Xi). (4.1.10) 


These estimators {p(a1),...,p(@m)} constitute the nonparametric estimate of the 
pmf p(x). Note that [;(X;) has a Bernoulli distribution with probability of success 
p(a;). Because 


E[p(a;)| = ~ FIG (X)] = — > pla;) = p(a;), (4.1.11) 


p(a;) is an unbiased estimator of p(a;). 
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Next, suppose that the space of X is infinite, say, D = {a1,a2,...}. In practice, 
we select a value, say, a, and make the groupings 


{a1}, {a2},.--,{@m},@m+1 = {@m41,Gm+42;--+}- (4.1.12) 
Let p(@m+41) be the proportion of sample items that are greater than or equal 
tO Gm+41. Then the estimates {p(ai),...,D(@m),P(@m+1)} form our estimate of 


p(a). For the merging of groups, a rule of thumb is to select m so that the fre- 
quency of the category a, exceeds twice the combined frequencies of the categories 
Am+1;4m4+2,++++ 

A histogram is a barplot of p(a;) versus a;. There are two cases to consider. For 
the first case, suppose the values a; represent qualitative categories, for example, 
hair colors of a population of people. In this case, there is no ordinal information 
in the ajs. The usual histogram for such data consists of nonabutting bars with 
heights p(a;) that are plotted in decreasing order of the p(ai)s. Such histograms 
are usually called bar charts. An example is helpful here. 


Example 4.1.5 (Hair Color of Scottish School Children). Kendall and Sturat 
(1979) present data on the eye and hair color of Scottish schoolchildren in the 
early 1900s. The data are also in the file scotteyehair.rda at the site listed in the 
Preface. In this example, we consider hair color. The discrete random variable is 
the hair color of a Scottish child with categories fair, red, medium, dark, and black. 
The results that Kendall and Sturat present are based on a sample of n = 22,361 
Scottish school children. The frequency distribution of this sample and the estimate 
of the pmf are 


[Far [ Red] Mediam | Dark | Back 


The bar chart of this sample is shown in Figure 4.1.1. Assume that the counts 
(second row of the table) are in the R vector vec. Then the following R segment 
computes this bar chart: 

n=sum(vec); vecs = sort(vec,decreasing=T)/n 

mms = c("Medium","Fair","Dark","Red", "Black" ) 

barplot (vecs, beside=TRUE ,names.arg=nms,ylab="",xlab="Haircolor") 


For the second case, assume that the values in the space D are ordinal in nature; 
ie., the natural ordering of the a;s is numerically meaningful. In this case, the usual 
histogram is an abutting bar chart with heights p(a;) that are plotted in the natural 
order of the aj;s, as in the following example. 


Example 4.1.6 (Simulated Poisson Variates). The following 30 data points are 
simulated values drawn from a Poisson distribution with mean = 2; see Example 
4.8.2 for the generation of Poisson variates. 
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Bar Chart of Haircolor of Scottish Schoolchildren 


Medium Fair Dark Red Black 
Haircolor 


Figure 4.1.1: Bar chart of the Scottish hair color data discussed in Example 4.1.5. 


The nonparametric estimate of the pmf is 


ie a ee ee ee 


The histogram for this data set is given in Figure 4.1.2. Note that counts are used 
for the vertical axis. If the R vector x contains the 30 data points, then the following 
R code computes this histogram: 

brs=seq(-.5,6.5,1) ;hist (x, breaks=brs,xlab="Number of events",ylab="") 
| 


The Distribution of X Is Continuous 


For this section, assume that the random sample X,,...,X, is from a continuous 
random variable X with continuous pdf f(t). We first sketch an estimate for this 
pdf at a specified value of xz. Then we use this estimate to develop a histogram 
estimate of the pdf. For an arbitrary but fixed point x and a given h > 0, consider 
the interval (« — h,a +h). By the mean value theorem for integrals, we have for 
some €, |a — €| < h, that 


ath 

Ple=h<X<a+h)= f f(t) dt = f(€)2h & f(ax)2h. 
x—h 

The nonparametric estimate of the leftside is the proportion of the sample items 

that fall in the interval (c —h,x +h). This suggests the following nonparametric 
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Histogram of Poisson Variates 


10 
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Figure 4.1.2: Histogram of the Poisson variates of Example 4.1.6. 


estimate of f(x) at a given x: 


a x—h i< ath 
f(z) = Se (4.1.13) 


To write this more formally, consider the indicator statistic 


Ge 1 a@-h<Xi<art+h 
w=) Oo otherwise, 


Then a nonparametric estimator of f (2) is 


n 


fg)= ae hilo) (4.1.14) 


Since the sample items are identically distributed, 


n~ 


1 

Elfl@)] = s—nf(@)2h = f© > f@), 

as h — 0. Hence f(z) is approximately an unbiased estimator of the density f(z). 
In density estimation terminology, the indicator function J; is called a rectangular 
kernel with bandwidth 2h. See Sheather and Jones (1991) and Chapter 6 of 
Lehmann (1999) for discussions of density estimation. The R function density 
provides a density estimator with several options. For the examples in the text, we 
use the default option as in Example 4.1.7. 
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The histogram provides a somewhat crude but often used estimator of the pdf, 
so a few remarks on it are pertinent. Let 71,...,2%, be the realized values of the 
random sample on a continuous random variable X with pdf f(x). Our histogram 
estimate of f(a) is obtained as follows. While for the discrete case, there are natural 
classes for the histogram, for the continuous case these classes must be chosen. One 
way of doing this is to select a positive integer m, an h > 0, and a value a such that 
a<min2z;, so that the m intervals 


(a—h,a+h], (ath, a+3h], (a+3h, a+5h],...,(a+(2m—3)h, at+(2m—1)h] (4.1.15) 


cover the range of the sample [min x;, max x;]. These intervals form our classes. Let 
A; = (a+ (27 —3)h,a+ (27 — 1)A] for j =1,...m. 

Let fr(x) denote our histogram estimate. If 7 <a—hora >a+(2m-—1)h 
then define fr(x) =0. Fora-—h<«<a+(2m-—1)h, « is in one, and only one, 
A;. For « € Aj, define fn(a) to be: 


~~ _ Ati € Aj} 


fala) — (4.1.16) 


Note that f(x) > 0 and that 


co a+(2m—1)h m ; : 
[faa = f fila) de => f EE 
=e ad j=l Jj 


—h 
1 m . oh 
ss Bn Hs € Aj}Ih(2j — 1-25 + 3)] = n= 1; 


sO, fr(x) satisfies the properties of a pdf. 

For the discrete case, except when classes are merged, the histogram is unique. 
For the continuous case, though, the histogram depends on the classes chosen. The 
resulting picture can be quite different if the classes are changed. Unless there is 
a compelling reason for the class selection, we recommend using the default classes 
selected by the computational algorithm. The histogram algorithms in most statis- 
tical packages such as R are current on recent research for selection of classes. The 
histogram in the following example is based on default classes. 


Example 4.1.7. In Example 4.1.3, we presented a data set involving sulfur dioxide 
concentrations in a damaged Bavarian forest. The histogram of this data set is 
found in Figure 4.1.3. There are only 24 data points in the sample which are far 
too few for density estimation. With this in mind, although the distribution of data 
is mound shaped, the center appears to be too flat for normality. We have overlaid 
the histogram with the default R density estimate (solid line) which confirms some 
caution on normality. Recall that sample mean and standard deviations for this 
data are 53.91667 and 10.07371, respectively. So we also plotted the normal pdf 
with this mean and standard deviation (dashed line). The R code assumes that the 
data are in the R vector sulfurdioxide. 

hist (sulfurdioxide, xlab="Sulfurdioxide",ylab="_",pr=T,ylim=c(0, .04)) 
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lines (density (sulfurdioxide) ) 
y=dnorm(sulfurdioxide ,53.91667,10.07371) ; lines (y~sulfurdioxide, lty=2) 
The normal density plot seems to be a poor fit. 


Histogram of sulfurdioxide 
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Figure 4.1.3: Histogram of the sulfur dioxide concentrations in a damaged Bavar- 
ian forest overlaid with a density estimate (solid line) and a normal pdf (dashed 
line) with mean and variance replaced by the sample mean and standard deviations, 
respectively. Data are given in Example 4.1.3. 


EXERCISES 


4.1.1. Twenty motors were put on test under a high-temperature setting. The 
lifetimes in hours of the motors under these conditions are given below. Also, the 
data are in the file lifetimemotor.rda at the site listed in the Preface. Suppose 
we assume that the lifetime of a motor under these conditions, X, has a I'(1, 6) 
distribution. 


1 4 5 21 22 28 40 42 #51 53 
58 67 95 124 124 160 202 260 303 363 


(a) Obtain a histogram of the data and overlay it with a density estimate, using 
the code hist (x,pr=T); lines(density(x)) where the R vector x contains 
the data. Based on this plot, do you think that the ['(1,@) model is credible? 


(b) Assuming a I'(1,0) model, obtain the maximum likelihood estimate 6 of @ and 


n~ 


locate it on your histogram. Next overlay the pdf of a [(1, 6) distribution on 
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the histogram. Use the R function dgamma(x,shape=1, scale=0) to evaluate 
the pdf. 


(c) Obtain the sample median of the data, which is an estimate of the median 
lifetime of a motor. What parameter is it estimating (i.e., determine the 
median of X)? 


(d) Based on the mle, what is another estimate of the median of X? 


4.1.2. Here are the weights of 26 professional baseball pitchers; [see page 76 of 
Hettmansperger and McKean (2011) for the complete data set]. The data are in R 
file bb.rda. Suppose we assume that the weight of a professional baseball pitcher 
is normally distributed with mean jz and variance o?. 


160 175 180 185 185 185 190 190 195 195 195 200 200 
200 200 205 205 210 210 218 219 220 222 225 225 232 


(a) Obtain a histogram of the data. Based on this plot, is a normal probability 
model credible? 


(b) Obtain the maximum likelihood estimates of jz, 07, 7, and fz/o. Locate your 
estimate of js on your plot in part (a). Then overlay the normal pdf with these 
estimates on your histogram in Part (a). 


(c) Using the binomial model, obtain the maximum likelihood estimate of the 
proportion p of professional baseball pitchers who weigh over 215 pounds. 


(d) Determine the mle of p assuming that the weight of a professional baseball 
player follows the normal probability model N (1,07) with and o unknown. 


4.1.3. Suppose the number of customers X that enter a store between the hours 
9:00 a.m. and 10:00 a.m. follows a Poisson distribution with parameter 0. Suppose 
a random sample of the number of customers that enter the store between 9:00 a.m. 
and 10:00 a.m. for 10 days results in the values 


9 7 9 15 10 138 11 7 2 12 


(a) Determine the maximum likelihood estimate of 6. Show that it is an unbiased 
estimator. 


(b) Based on these data, obtain the realization of your estimator in part (a). 
Explain the meaning of this estimate in terms of the number of customers. 


4.1.4. For Example 4.1.3, verify equations (4.1.4)—(4.1.8). 
4.1.5. Let X1, X2,..., Xp be arandom sample from a continuous-type distribution. 


(a) Find P(X, < Xo), P(X < Xo, X1 < X3),...,P(X1 < X;,1 = 2,3,...,n). 
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(b) Suppose the sampling continues until X, is no longer the smallest observation 
(ie., Xj) << X1 < Xj,1 = 2,3,...,7-—1). Let Y equal the number of trials, not 
including X1, until X1 is no longer the smallest observation (i.e., Y = j — 1). 
Show that the distribution of Y is 

1 


P(Y =y) = ——, y= l,2,3,.... 
( yy +1) 


(c) Compute the mean and variance of Y if they exist. 


4.1.6. Consider the estimator of the pmf in expression (4.1.10). In equation (4.1.11), 
we showed that this estimator is unbiased. Find the variance of the estimator and 
its megf. 


4.1.7. The data set on Scottish schoolchildren discussed in Example 4.1.5 included 
the eye colors of the children also. The frequencies of their eye colors are 


Blue Light Medium Dark 
2978 6697 7oll 5175 


Use these frequencies to obtain a bar chart and an estimate of the associated pmf. 


4.1.8. Recall that for the parameter 7 = g(@), the mle of 77 is g(0), where @ is the 
mle of 6. Assuming that the data in Example 4.1.6 were drawn from a Poisson 
distribution with mean A, obtain the mle of \ and then use it to obtain the mle of 
the pmf. Compare the mle of the pmf to the nonparametric estimate. Note: For 
the domain value 6, obtain the mle of P(X > 6). 


4.1.9. Consider the nonparametric estimator, (4.1.14), of a pdf. 
(a) Obtain its mean and determine the bias of the estimator. 
(b) Obtain the variance of the estimator. 


4.1.10. This data set was downloaded from the site http://lib.stat.cmu.edu/DASL/ 
at Carnegie-Melon university. The original source is Willerman et al. (1991). The 
data consist of a sample of brain information recorded on 40 college students. The 
variables include gender, height, weight, three IQ measurements, and Magnetic 
Resonance Imaging (MRI) counts, as a determination of brain size. The data are in 
the rda file braindata.rda at the sites referenced in the Preface. For this exercise, 
consider the MRI counts. 


(a) Load the rda file braindata.rda and print the MRI data, using the code: 
mri <- braindata[,7]; print(mri). 


(b) Obtain a histogram of the data, hist (mri,pr=T). Comment on the shape. 


(c) Overlay the default density estimator, lines(density(mri)). Comment on 
the shape. 
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(d) Obtain the sample mean and standard deviation and on the histogram overlay 
the normal pdf with these estimates as parameters, using mris=sort (mri) and 
lines (dnorm(mris,mean(mris) ,sd(mris))~mris,1lty=2). Comment on the 


fit. 


(e) Determine the proportions of the data within 1 and 2 standard deviations of 
the sample mean and compare these with the empirical rule. 


4.1.11. This is a famous data set on the speed of light recorded by the scientist 
Simon Newcomb. The data set was obtained at the Carnegie Melon site given in 
Exercise 4.1.10 and it can also be found in the rda file speedlight .rda at the sites 
referenced in the Preface. Stigler (1977) presents an informative discussion of this 
data set. 


(a) Load the rda file and type the command print (speed). As Stigler notes, the 
data values x10~3 + 24.8 are Newcomb’s data values; hence, negative items 
can occur. Also, in the unit of the data the “true value” is 33.02. Discuss the 
data. 


(b) Obtain a histogram of the data. Comment on the shape. 


(c) On the histogram overlay the default density estimator. Comment on the 
shape. 


(d) Obtain the sample mean and standard deviation and on the histogram overlay 
the normal pdf with these estimates as parameters. Comment on the fit. 


(e) Determine the proportions of the data within 1 and 2 standard deviations of 
the sample mean and compare these with the empirical rule. 


4.2 Confidence Intervals 


Let us continue with the statistical problem that we were discussing in Section 
4.1. Recall that the random variable of interest X has density f(#;6),0 € Q, 
where @ is unknown. In that section, we discussed estimating 6 by a statistic 
@ = (Xi, ...;Xn), where X1,..., Xp is a sample from the distribution of X. When 
the sample is drawn, it is unlikely that the value of 6 is the true value of the 
parameter. In fact, if @ has a continuous distribution, then Po (0 = = 0) = 0, where the 
notation Pg denotes that the probability is computed when 6@ is the true parameter. 
What is needed is an estimate of the error of the estimation; i.e., by how much did 
@ miss 0? In this section, we embody this estimate of error in terms of a confidence 
interval, which we now formally define: 


Definition 4.2.1 (Confidence Interval). Let X1, X2,...,Xn be a sample on a ran- 
dom variable X, where X has pdf f(a;0), @€ Q. LetO <a <1 be specified. Let 
L=L(X1, Xo,...,Xn) and U = U(X, X2,...,Xn) be two statistics. We say that 
the interval (L,U) is a (1 — a)100% confidence interval for 0 if 


1—a=P,l6 € (L,U)]. (4.2.1) 
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That is, the probability that the interval includes @ is 1 — a, which is called the 
confidence coefficient or the confidence level of the interval. 


Once the sample is drawn, the realized value of the confidence interval is (I, u), 
an interval of real numbers. Either the interval (1, u) traps @ or it does not. One way 
of thinking of a confidence interval is in terms of a Bernoulli trial with probability 
of success 1—a. If one makes, say, M independent (1—«a)100% confidence intervals 
over a period of time, then one would expect to have (1—a)M successful confidence 
intervals (those that trap 0) over this period of time. Hence one feels (1 — a)100% 
confident that the true value of 6 lies in the interval (J, w). 

A measure of efficiency based on a confidence interval is its expected length. 
Suppose (£1, U1) and (L2, U2) are two confidence intervals for @ that have the same 
confidence coefficient. Then we say that (£1,U;) is more efficient than (L2, U2) if 
Eg(U, = I) < Eo (U2 = L2) for al OE. 

There are several procedures for obtaining confidence intervals. We explore one 
of them in this section. It is based on a pivot random variable. The pivot is usually 
a function of an estimator of @ and the parameter and, further, the distribution of 
the pivot is known. Using this information, an algebraic derivation can often be 
used to obtain a confidence interval. The next several examples illustrate the pivot 
method. A second way to obtain a confidence interval involves distribution free 
techniques, as used in Section 4.4.2 to determine confidence intervals for quantiles. 


Example 4.2.1 (Confidence Interval for 4 Under Normality). Suppose the random 
variables X,,...,X,, are a random sample from a N(j,07) distribution. Let X and 
5S? denote the sample mean and sample variance, respectively. Recall from the last 
section that X is the mle of yz and [(n — 1)/n]S? is the mle of o?. By part (d) of 
Theorem 3.6.1, the random variable T = (X — p)/(S/,/n) has a t-distribution with 
n — 1 degrees of freedom. The random variable T is our pivot variable. 

For 0 < a < 1, define ta/2n-1 to be the upper a/2 critical point of a t- 
distribution with n — 1 degrees of freedom; i.e., a/2 = P(T > ta/2n—1). Using 
a simple algebraic derivation, we obtain 


l-e = Pte /anat <1 < tejan—a) 


Xu 
=. Pie ee 
 ( Wf 2 it S//n /2, :) 


iS es S 
= Pi (~ta/ami <X-p< tajan-1-ae) 
— S = Ss 
Pu (x - ta /2.n—1 << tajan-i-e) . 
Once the sample is drawn, let 7 and s denote the realized values of the statistics X 

and S, respectively. Then a (1 — a)100% confidence interval for yu is given by 


(E — ta/2n—18/Vn,& + ta/2n—18/Vn). (4.2.3) 


This interval is often referred to as the (1 — a)100% t-interval for 1. The estimate 
of the standard deviation of X, s/,/n, is referred to as the standard error of X. 


(4.2.2) 
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In Example 4.1.3, we presented a data set on sulfur dioxide concentrations in a 
damaged Bavarian forest. Let denote the true mean sulfur dioxide concentration. 
Recall, based on the data, that our estimate of ys is F = 53.92 with sample standard 
deviation s = 101.48 = 10.07. Since the sample size is n = 24, for a 99% confidence 
interval the ¢-critical value is to.005,23 =qt(.995,23) = 2.807. Based on these 
values, the confidence interval in expression (4.2.3) can be calculated. Assuming 
that the R vector sulfurdioxide contains the sample, the R code to compute this 
interval is t.test (sulfurdioxide, conf.level=0.99), which results in the 99% 
confidence interval (48.14, 59.69). Many scientists write this interval as 53.92+5.78. 
In this way, we can see our estimate of 4 and the margin of error. 


The distribution of the pivot random variable T = (X — p)/(s/./7) of the last 
example depends on the normality of the sampled items; however, this is approx- 
imately true even if the sampled items are not drawn from a normal distribution. 
The Central Limit Theorem (CLT) shows that the distribution of T is approxi- 
mately N(0,1). In order to use this result now, we state the CLT now, leaving its 
proof to Chapter 5; see Theorem 5.3.1. 


Theorem 4.2.1 (Central Limit Theorem). Let X1, X2,...,X» denote the observa- 
tions of a random sample from a distribution that has mean and finite variance 
o*. Then the distribution function of the random variable Wy, = (X — p)/(a/Vn) 
converges to ®, the distribution function of the N(0,1) distribution, as n — oo. 


As we further show in Chapter 5, the result stays the same if we replace o by 
the sample standard deviation S$; that is, under the assumptions of Theorem 4.2.1, 
the distribution of = 

X= pb 
— «S/n 
is approximately N (0,1). For the nonnormal case, as the next example shows, with 
this result we can obtain an approximate confidence interval for ju. 


Zn (4.2.4) 


Example 4.2.2 (Large Sample Confidence Interval for the Mean 4). Suppose 
X1,X2,...,Xy is a random sample on a random variable X with mean p and 
variance a7, but, unlike the last example, the distribution of X is not normal. How- 
ever, from the above discussion we know that the distribution of Z,, (4.2.4), is 
approximately N(0,1). Hence 


xX — 
1-0 P, (~soa< ZH < ca). 


Using the same algebraic derivation as in the last example, we obtain 


= Ss _ Ss 
l-awP, (toe <u< Xt toa). (4.2.5) 
Again, letting = and s denote the realized values of the statistics X and S, respec- 
tively, after the sample is drawn, an approximate (1 — a)100% confidence interval 
for 4 is given by 


(E — 29/28/J/n, E + 20/28//n). (4.2.6) 


This is called a large sample confidence interval for ju. Oo 
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In practice, we often do not know if the population is normal. Which confidence 
interval should we use? Generally, for the same a, the intervals based on t/2.n—1 
are larger than those based on z,/2. Hence the interval (4.2.3) is generally more 
conservative than the interval (4.2.6). So in practice, statisticians generally prefer 
the interval (4.2.3). 

Occasionally in practice, the standard deviation o is assumed known. In this 
case, the confidence interval generally used for p is (4.2.6) with s replaced by o. 


Example 4.2.3 (Large Sample Confidence Interval for p). Let X be a Bernoulli 
random variable with probability of success p, where X is 1 or 0 if the outcome is 
success or failure, respectively. Suppose X1,..., Xp, is a random sample from the 
distribution of X. Let ~ = X be the sample proportion of successes. Note that 
p = n-'So_, X; is a sample average and that Var(p) = p(1— p)/n. It follows 
immediately from the CLT that the distribution of Z = (p — p)/.\/p(1 — p)/n is 
approximately N (0,1). Referring to Example 5.1.1 of Chapter 5, we replace p(1—p) 
with its estimate p(1—/p). Then proceeding as in the last example, an approximate 
(1 — a)100% confidence interval for p is given by 


(P — Zaj2V PUL — p)/n, P+ Za/2V PUL — p)/n), (4.2.7) 


where ,/p(1 — p)/n is called the standard error of p. 

In Example 4.1.2 we discussed a data set involving hip replacements, some 
of which were squeaky. The outcomes of a hip replacement were squeaky and 
non-squeaky which we labeled as success or failure, respectively. In the sam- 
ple there were 28 successes out of 143 replacements. Using R, the 99% confi- 
dence interval for p, the probability of a squeaky hip replacement, is computed by 
prop.test (28,143, conf.level=.99), which results in the interval (0.122, 0.298). 
So with 99% confidence, we estimate the probability of a squeaky hip replacement 
to be between 0.122 and 0.298. 


4.2.1 Confidence Intervals for Difference in Means 


A practical problem of interest is the comparison of two distributions, that is, com- 
paring the distributions of two random variables, say X and Y. In this section, we 
compare the means of X and Y. Denote the means of X and Y by ji and pie, respec- 
tively. In particular, we obtain confidence intervals for the difference A = p1 — po. 
Assume that the variances of X and Y are finite and denote them as 07 = Var(X) 
and o} = Var(Y). Let X,...,Xn, be a random sample from the distribution of X 
and let Yi,...,¥n, be a random sample from the distribution of Y. Assume that 
the samples were gathered independently of one another. Let X = nj ye 
and Y = nz' 57, Y; be the sample means. Let A =X ~Y. The statistic A is 
an unbiased estimator of A. This difference, Ax A, is the numerator of the pivot 
random variable. By independence of the samples, 


a. O71 o% 
Var(A) = + 4+ 4 


Ny ne 
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Let $2 = (ny — 1)7-1 002, (Xi — X)? and $2 = (nz — 1)-1 092, (Vi — Y)? be the 
sample variances. Then estimating the variances by the sample variances, consider 
the random variable 


AA 

Be (4.2.8) 
3 83 
maT fa 


By the independence of the samples and Theorem 4.2.1, this pivot variable has 
an approximate N(0,1) distribution. This leads to the approximate (1 — a)100% 
confidence interval for A = jz; — 2 given by 


Cpe" 43 Gta 22 (42.9) 
ny ng’ my ng ]’ 


where \/(s?/n1) + (s3/nz2) is the standard error of X — Y. This is a large sample 
(1 — a)100% confidence interval for fy — p12. 

The above confidence interval is approximate. In this situation we can obtain 
exact confidence intervals if we assume that the distributions of X and Y are normal 
with the same variance; i.e., 7? = 03. Thus the distributions can differ only in 
location, ie., a location model. Assume then that X is distributed N (11,07) 
and Y is distributed N(j12,07), where o? is the common variance of X and Y. 
As above, let X1,...,Xn, be a random sample from the distribution of X, let 
Yi,..-, Yn. be a random sample from the distribution of Y, and assume that the 
samples are independent of one another. Let n = n1 + n2 be the total sample size. 
Our estimator of A is X — Y. Our goal is to show that a pivot random variable, 
defined below, has a ¢-distribution, which is defined in Section 3.6. 

Because X is distributed N(1,07/n1), Y is distributed N(2,07/n2), and X 
and Y are independent, we have the result 


(YO) Fan N(0, 1) distribution. (4.2.10) 


al: ul 
OV np tag 


This serves as the numerator of our T-statistic. 
Let 


ni — 1)S? + (n2 — 1)S3 


= ( (4.2.11) 


ny tng —-2 


Note that S? is a weighted average of Sj and $3. It is easy to see that S? is 
an unbiased estimator of 07. It is called the pooled estimator of o?. Also, 
because (ni — 1).$?/o0? has a y?(n1 —1) distribution, (nz —1)$3/o0? has a y?(n2—1) 
distribution, and Sj and $3 are independent, we have that (n — 2)S>/o has a 
x?(n — 2) distribution; see Corollary 3.3.1. Finally, because $? is independent of 
X and S$} is independent of Y, and the random samples are independent of each 
other, it follows that S3 is independent of expression (4.2.10). Therefore, from the 
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result of Section 3.6.1 concerning Student’s t-distribution, we have that 


= 22 (4.2.12) 


has a ¢-distribution with n — 2 degrees of freedom. From this last result, it is easy 
to see that the following interval is an exact (1 — a)100% confidence interval for 


A = pt — [H2: 


ti 1 1 
( — 9) — ta/an—2)8p4/ — + — (E—8) + tie2,n—2)8p4/ — + =) . (4.2.13) 


My n2 ny n2 
A consideration of the difficulty encountered when the unknown variances of the 
two normal distributions are not equal is assigned to one of the exercises. 


Example 4.2.4. To illustrate the pooled t-confidence interval, consider the baseball 
data presented in Hettmansperger and McKean (2011). It consists of 6 variables 
recorded on 59 professional baseball players of which 33 are hitters and 26 are pitch- 
ers. The data can be found in the file bb. rda located at the site listed in Chapter 
1. The height in inches of a player is one of these measurements and in this exam- 
ple we consider the difference in heights between pitchers and hitters. Denote the 
true mean heights of the pitchers and hitters by 4, and jp, respectively, and let 
A = [yp — ftp. The sample averages of the heights are 75.19 and 72.67 inches for 
the pitchers and hitters, respectively. Hence, our point estimate of A is 2.53 inches. 
Assuming the file bb. rda has been loaded in R, the following R segment computes 
the 95% confidence interval for A: 

hitht=height [hitpitind==1]; pitht=height [hitpitind==0] 

t.test(pitht ,hitht,var.equal=T) 
The confidence interval computes to (1.42, 3.63). Note that all values in the confi- 
dence interval are positive, indicating that on the average pitchers are taller than 
hitters. @ 


Remark 4.2.1. Suppose X and Y are not normally distributed but that their 
distributions differ only in location. As we show in Chapter 5, the above interval, 
(4.2.13), is then approximate and not exact. m 


4.2.2 Confidence Interval for Difference in Proportions 


Let X and Y be two independent random variables with Bernoulli distributions 
b(1, pi) and b(1, p2), respectively. Let us now turn to the problem of finding a confi- 
dence interval for the difference p; — pe. Let X1,..., Xn, be a random sample from 
the distribution of X and let Y1,..., Yn, be a random sample from the distribution 
of Y. As above, assume that the samples are independent of one another and let 
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n =n,+ Nz be the total sample size. Our estimator of p; — po is the difference in 
sample proportions, which, of course, is given by X — Y. We use the traditional 
notation and write pj, and jy instead of X and Y, respectively. Hence, from the 
above discussion, an interval such as (4.2.9) serves as an approximate confidence 
interval for pj — po. Here, 0? = pi(1 — pi) and of = po(1 — p2). In the interval, 
we estimate these by p;(1— 1) and po(1 — pe), respectively. Thus our approximate 
(1 — a)100% confidence interval for p; — p2 is 


at as 
Bi — po + Za2 Pil = Pa) | Pall = ba) (4.2.14) 
Ny ne 


Example 4.2.5. Kloke and McKean (2014), page 33, discuss a data set from the 
original clinical study of the Salk polio vaccine in 1954. At random, one group of 
children (Treated) received the vaccine while the other group (Control) received a 
placebo. Let p. and pr denote the true proportions of polio cases for control and 
treated populations, respectively. The tabled results are: 


No. Children | No. Polio Cases 
200,745 0.000284 


201,229 0.000706 


Note that fc > pr. The following R segment computes the 95% confidence interval 
for Pe — pr: 

prop.test(c(199,57) ,c (201229, 200745) ) 
The confidence interval is (0.00054, 0.00087). All values in this interval are positive, 
indicating that the vaccine is effective in reducing the incidence of polio. m 


EXERCISES 


4.2.1. Let the observed value of the mean X and of the sample variance of a random 
sample of size 20 from a distribution that is N(j,07) be 81.2 and 26.5, respectively. 
Find respectively 90%, 95% and 99% confidence intervals for yz. Note how the 
lengths of the confidence intervals increase as the confidence increases. 


4.2.2. Consider the data on the lifetimes of motors given in Exercise 4.1.1. Obtain 
a large sample 95% confidence interval for the mean lifetime of a motor. 


4.2.3. Suppose we assume that X1, X2,...,X,, is a random sample from a ['(1, @) 
distribution. 


(a) Show that the random variable (2/0) )>;"_, X; has a y?-distribution with 2n 
degrees of freedom. 


(b) Using the random variable in part (a) as a pivot random variable, find a 
(1 — a)100% confidence interval for @. 


(c) Obtain the confidence interval in part (b) for the data of Exercise 4.1.1 and 
compare it with the interval you obtained in Exercise 4.2.2. 
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4.2.4. In Example 4.2.4, for the baseball data, we found a confidence interval for 
the mean difference in heights between the pitchers and hitters. In this exercise, 
find the pooled t 95% confidence interval for the mean difference in weights between 
the pitchers and hitters. 


4.2.5. In the baseball data set discussed in the last exercise, it was found that out 
of the 59 baseball players, 15 were left-handed. Is this odd, since the proportion of 
left-handed males in America is about 11%? Answer by using (4.2.7) to construct a 
95% approximate confidence interval for p, the proportion of left-handed professional 
baseball players. 


4.2.6. Let X be the mean of a random sample of size n from a distribution that is 
N(u,9). Find n such that P(X —1< p< X +1) = 0.90, approximately. 


4.2.7. Let a random sample of size 17 from the normal distribution N (1,07) yield 
@ = 4.7 and s? = 5.76. Determine a 90% confidence interval for pu. 


4.2.8. Let X denote the mean of a random sample of size n from a distribution that 
has mean ps and variance 0? = 10. Find n so that the probability is approximately 
0.954 that the random interval (X — 5, X +3) includes p. 


4.2.9. Let X1, X2,...,X9 be a random sample of size 9 from a distribution that is 
N(u, 07). 


(a) Ifo is known, find the length of a 95% confidence interval for y if this interval 
is based on the random variable /9(X — 1)/o. 


(b) If o is unknown, find the expected value of the length of a 95% confidence 
interval for p if this interval is based on the random variable /9(X — p)/S. 
Hint: Write E(S') = (o/V/n— 1)E[((n — 1)$?/o?)1/?]. 


(c) Compare these two answers. 


4.2.10. Let X 1, X2,...,Xn,Xn+4+1 be arandom sample of sizen+1, n> 1, froma 
distribution that is N(u,07). Let X = cy Xi/n and S? = 7} (X; — X)?/(n— 1). 
Find the constant c so that the statistic c(X — Xn41)/9 has a t-distribution. If 
n = 8, determine k such that P(X —kS < X9 < X +kS) = 0.80. The observed 
interval (¥ — ks, Z+ ks) is often called an 80% prediction interval for Xo. 


4.2.11. Let X1,...,X» be arandom sample from a N(0,1) distribution. Then the 
probability that the random interval X + te/2,n—1(8//n) traps p = 0 is (1—a). To 
verify this empirically, in this exercise, we simulate m such intervals and calculate 
the proportion that trap 0, which should be “close” to (1 — a). 


(a) Set n = 10 and m = 50. Run the R code mat=matrix(rnorm(m*n) ,ncol=n) 
which generates m samples of size n from the N(0,1) distribution. Each row 
of the matrix mat contains a sample. For this matrix of samples, the function 
below computes the (1 — a@)100% confidence intervals, returning them in a 
m xX 2 matrix. Run this function on your generated matrix mat. What is the 
proportion of successful confidence intervals? 
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getcis <- function(mat,cc=.90){ 

numb <- length(mat[,1]); ci <- cQ 

for(j in 1:numb) 
{ci<-rbind(ci,t.test(mat[j,],conf.level=cc)$conf.int)} 
return(ci)} 

This function is also at the site discussed in Section 1.1. 


(b) Run the following code which plots the intervals. Label the successful inter- 
vals. Comment on the variability of the lengths of the confidence intervals. 
cis<-getcis(mat); x<-1:m 
plot(c(cis[,1],cis[,2])~c(x,x) ,pch="",xlab="Sample", ylab="CI") 
points(cis[,1]~x,pch="L") ;points(cis[,2]~x,pch="U"); abline(h=0) 


4.2.12. In Exercise 4.2.11, the sampling was from the N(0,1) distribution. Show, 
however, that setting 4 = 0 and o = 1 is without loss of generality. 

Hint: First, X1,...,X» is a random sample from the N(,o7) if and only if 
Z\,.-.,;Zn is a random sample from the N(0,1), where Z; = (X; — )/o. Then 
show the confidence interval based on the Z;’s contains 0 if and only if the confi- 
dence interval based on the X;’s contains pu. 


4.2.13. Change the code in the R function getcis so that it also returns the vector, 
ind, where ind[i] = 1 if the 7th confidence interval is successful and 0 otherwise. 
Show that the empirical confidence level is mean(ind). 


(a) Run 10,000 simulations for the normal setup in Exercise 4.2.11 and compute 
the empirical confidence level. 


(b) Run 10,000 simulations when the sampling is from the Cauchy distribution, 
1.8.8), and compute the empirical confidence level. Does it differ from (a)? 
Note that the R code rcauchy(k) returns a sample of size k from this Cauchy 
distribution. 


(c) Note that these empirical confidence levels are proportions from samples that 
are independent. Hence, use the 95% confidence interval given in expression 
4.2.14) to statistically investigate whether or not the true confidence levels 
differ. Comment. 


4.2.14. Let X denote the mean of a random sample of size 25 from a gamma-type 
distribution with a = 4 and @ > 0. Use the Central Limit Theorem to find an 
approximate 0.954 confidence interval for yu, the mean of the gamma distribution. 
Hint: Use the random variable (X — 43)/(40?/25)!/? = 5X /28 — 10. 


4.2.15. Let % be the observed mean of a random sample of size n from a distribution 
having mean y and known variance o?. Find n so that T— 0/4 to F+0/4 is an 
approximate 95% confidence interval for pu. 


4.2.16. Assume a binomial model for a certain random variable. If we desire a 90% 
confidence interval for p that is at most 0.02 in length, find n. 


Hint: Note that /(y/n)(1 — y/n) < \/($)(1 — 3). 
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4.2.17. It is known that a random variable X has a Poisson distribution with 
parameter 4. A sample of 200 observations from this distribution has a mean equal 
to 3.4. Construct an approximate 90% confidence interval for pu. 


4.2.18. Let X1, X2,..., Xn be arandom sample from N(y, 07), where both param- 
eters . and o? are unknown. A confidence interval for o? can be found as follows. 
We know that (n—1)S?/o? is a random variable with a y?(n—1) distribution. Thus 
we can find constants a and b so that P((n — 1)S?/o? < b) = 0.975 and P(a < 
(n—1)S?/o? < b) = 0.95. In R, b = qchisq(0.975,n-1), while a = qchisq(0.025,n-1). 


(a) Show that this second probability statement can be written as 


P((n —1)S?/b < 0? < (n—1)S?/a) = 0.95. 


(b) Ifn =9 and s? = 7.93, find a 95% confidence interval for o?. 


(c) If « is known, how would you modify the preceding procedure for finding a 
confidence interval for 7? 


4.2.19. Let X1, X2,..., Xn be a random sample from a gamma distribution with 
known parameter a = 3 and unknown ( > 0. In Exercise 4.2.14, we obtained an 
approximate confidence interval for @ based on the Central Limit Theorem. In this 
exercise obtain an exact confidence interval by first obtaining the distribution of 
2301 Xi/8. 


Hint: Follow the procedure outlined in Exercise 4.2.18. 


4.2.20. When 100 tacks were thrown on a table, 60 of them landed point up. Obtain 
a 95% confidence interval for the probability that a tack of this type lands point 
up. Assume independence. 


4.2.21. Let two independent random samples, each of size 10, from two normal 
distributions N(1,07) and N(t2,07) yield 7 = 4.8, s? = 8.64, 7 = 5.6, 53 = 7.88. 
Find a 95% confidence interval for j1y — pa. 


4.2.22. Let two independent random variables, Y; and Y2, with binomial distribu- 
tions that have parameters n,; = ng = 100, pi, and pz, respectively, be observed 
to be equal to yy = 50 and yg = 40. Determine an approximate 90% confidence 
interval for py — po. 


4.2.23. Discuss the problem of finding a confidence interval for the difference ~u4— p12 
between the two means of two normal distributions if the variances 0? and o% are 
known but not necessarily equal. 


4.2.24. Discuss Exercise 4.2.23 when it is assumed that the variances are unknown 
and unequal. This is a very difficult problem, and the discussion should point out 
exactly where the difficulty lies. If, however, the variances are unknown but their 
ratio 0? /o2 is a known constant k, then a statistic that is a T random variable can 
again be used. Why? 
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4.2.25. To illustrate Exercise 4.2.24, let X1, X2,...,X9 and Y,,Y2,...,Yi2 rep- 
resent two independent random samples from the respective normal distributions 
N(u1,07) and N(y2,03). It is given that o?7 = 303, but o3 is unknown. Define a 
random variable that has a ¢-distribution that can be used to find a 95% confidence 
interval for 4 — pe. 


4.2.26. Let X and Y be the means of two independent random samples, each of size 
n, from the respective distributions N(1,07) and N(2,07), where the common 
variance is known. Find n such that 


P(X —Y —o/5 < wy — po < X —Y +0/5) = 0.90. 


4.2.27. Let X,, Xo,..., Xp and Yj, Y2,..., Ym be two independent random samples 
from the respective normal distributions N(j1,07) and N(2,03), where the four 
parameters are unknown. To construct a confidence interval for the ratio, 07/032, of 
the variances, form the quotient of the two independent x? variables, each divided 
by its degrees of freedom, namely, 


(m—1)S3 = 
ge Are 1) _ §2/e2 


/ Se iat) Siar 
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where S? and S3 are the respective sample variances. 
(a) What kind of distribution does F' have? 


(b) Critical values a and b can be found so that P(F < b) = 0.975 and P(a < 
F <b) =0.95. In R, b = qf(0.975,m-1,n-1), while a = qf(0.025,m-1,n-1). 


(c) Rewrite the second probability statement as 


S? 0? 8? 
Pla—s < = < b—| =0.95. 
ee 


The observed values, s? and s3, can be inserted in these inequalities to provide 
a 95% confidence interval for 0?/03. 


We caution the reader on the use of this confidence interval. This interval does 
depend on the normality of the distributions. If the distributions of X and Y 
are not normal then the true confidence coefficient may be far from the nominal 
confidence coefficient; see, for example, page 142 of Hettmansperger and McKean 
(2011) for discussion. 


4.3 *Confidence Intervals for Parameters of Dis- 
crete Distributions 


In this section, we outline a procedure that can be used to obtain exact confidence 
intervals for the parameters of discrete random variables. Let X1, X2,...,Xn bea 
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random sample on a discrete random variable X with pmf p(x;0), 6 € Q, where 
Q is an interval of real numbers. Let T = T(X1, X2,...,Xn) be an estimator of 6 
with cdf F(t; 0). Assume that F’r(t;@) is a nonincreasing and continuous function 
of 6 for every t in the support of T. For a given realization of the sample, let t be 
the realized value of the statistic T. Let a; > 0 and ag > 0 be given such that 
a=a,+a < 0.50. Let @ and 6 be the solutions of the equations 


Fr(t-;9) =1—a2 and Fr(t;6) =a, (4.3.1) 


where J— is the statistic whose support lags by one value of T’s support. For 
instance, if t; < t;41 are consecutive support values of T,, then T = t,41 if and only 
if T— = t;. Under these conditions, the interval (9,0) is a confidence interval for 0 
with confidence coefficient of at least 1 — a. We sketch a proof of this at the end of 
this section. 

Before proceeding with discrete examples, we provide an example in the con- 
tinuous case where the solution of equations (4.3.1) produces a familiar confidence 
interval. 


Example 4.3.1. Assume X1,...,X, is a random sample from a N(0,07) distri- 
bution, where o? is known. Let X be the sample mean and let Z be its value for a 
given realization of the sample. Recall, from expression (4.2.6), that T+ 2, /2(¢/./n) 
is a (1 — a)100% confidence interval for 6. Assuming @ is the true mean, the cdf 
of X is Fz.(t) = ®[(t — 6)/(a/Vn)], where ®(z) is the cdf of a standard normal 
distribution. Note for the continuous case that X— has the same distribution as X. 
Then the first equation of (4.3.1) yields 


&[(E — 8)/(o/Vn)] =1- (a/2); 


Le., 
(E— 0)/(o/Vn) = ® [1 — (a/2)] = 2a;2- 
Solving for 6, we obtain the lower bound of the confidence interval T — zq/2(a//7). 


Similarly, the solution of the second equation is the upper bound of the confidence 
interval. m 


For the discrete case, generally iterative algorithms are used to solve equations 
(4.3.1). In practice, the function F(T; 9) is often strictly decreasing and continuous 
in 0, so a simple algorithm often suffices. We illustrate the examples below by using 
the simple bisection algorithm, which we now briefly discuss. 


Remark 4.3.1 (Bisection Algorithm). Suppose we want to solve the equation 
g(x) = d, where g() is strictly decreasing. Assume on a given step of the algorithm 
that a < b bracket the solution; i.e., g(a) > d > g(b). Let c = (a+ b)/2. Then on 
the next step of the algorithm, the new bracket values a and b are determined by 


if(g(c) > d) then {a—candb< }} 
if(g(c) << d) then {a-—aandb< c}. 


The algorithm continues until |a — b| < €, where € > 0 is a specified tolerance. m 
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Example 4.3.2 (Confidence Interval for a Bernoulli Proportion). Let X have a 
Bernoulli distribution with @ as the probability of success. Let Q = (0,1). Suppose 
X,,Xo,...,Xpn is a random sample on X. As our point estimator of 6, we consider 
X, which is the sample proportion of successes. The cdf of nX is binomial(n.6). 
Thus 


Fy(z;0) = P(nX < nz) 
= ") (1 — 9)" 
j=0 J 
n ei ; ; 
= 1S (mao 
j=nz+1 J 
1 f — eg yr gy (4.3.2 
~ =f iDin-@ee tO?) i AS) 


where the last equality, involving the incomplete ({-function, follows from Exercise 
4.3.6. By the fundamental theorem of calculus and expression (4.3.2), 


d n! 


pak) FY oe = ne _ g\n—(nzt+1) 
Fx(@9)-—-Gain—@eeie 1 ~% om 
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hence, F(Z; 6) is a strictly decreasing function of 6, for each %. Next, let a1, a2 > 0 
be specified constants such that a; + a2 < 1/2 and let @ and @ solve the equations 


Fy(t-;0) =1—ag and Fy(X;0) =a. (4.3.3) 


Then (0,6) is a confidence interval for 9 with confidence coefficient at least 1 — a, 
where @ = a; + ag. These equations can be solved iteratively, as discussed in the 
following numerical illustration. 


Numerical Illustration. Suppose n = 30 and the realization of the sample mean 
is F = 0.60, i.e., the sample produced n¥ = 18 successes. Take a; = a2 = 0.05. 
Because the support of the binomial consists of integers and nt = 18, we can write 
equations (4.3.3) as 

Djeo (7)07(1. - 8)" 9 =0.95 and 2, (AF U-7)" 4 =0.05. (43.4) 
Let bin(n, p) denote a random variable with binomial distribution with parameters 
n and p. Because P(bin(30,0.4) < 17) = pbinom(17,30,.4) = 0.9787 and because 
P(bin(30, 0.45) < 17) = pbinom(17,30,.45) = 0.9286, the values 0.4 and 0.45 bracket 
the solution to the first equation. We use these bracket values as input to the R 
function! binomci.r which iteratively solves the equation. The call and its output 


are: 
> binomci(17,30,.4,.45,.95); $solution 0.4339417 


1Download this function at the site given in the preface. 
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So the solution to the first equation is @ = 0.434. In the same way, because 
P(bin(30,0.7) < 18) = 0.1593 and P(bin(30,0.8) < 18) = 0.0094, the values 0.7 
and 0.8 bracket the solution to the second equation. The R segment for the solution 
is: 
> binomci(18,30,.7,.8,.05); $solution 0.75047 

Thus the confidence interval is (0.434,0.750), with a confidence of at least 90%. 
For comparison, the asymptotic 90% confidence interval of expression (4.2.7) is 
(0.453, 0.747); see Exercise 4.3.2. m 


Example 4.3.3 (Confidence Interval for the Mean of a Poisson Distribution). Let 
X 1, X2,...,Xy be a random sample on a random variable X that has a Poisson 
distribution with mean 6. Let X = n~!)7"_, X; be our point estimator of 0. As 
with the Bernoulli confidence interval in the last example, we can work with nX, 
which, in this case, has a Poisson distribution with mean n@. The cdf of X is 


Fe(B0) = Sree OF 


= — oe? da. (4.3.5) 


where the integral equation is obtained in Exercise 4.3.7. From expression (4.3.5), 
we immediately have 


d —n 


— Fx(F; 0) = ———~ (n0)""e"" < 0. 

dé Rae) T(nt + 1” ve 

Therefore, Fy(¥; 6) is a strictly decreasing function of @ for every fixed %. For a 
given sample, let be the realization of the statistic X. Hence, as discussed above, 
for a1,a2 > 0 such that a; + az < 1/2, the confidence interval is given by (0,6) 
where 


o] 


. enh ey” =1-—ap and ar en — = 04, (4.3.6) 


The confidence coefficient of the interval (8,0) is at least 1—a = 1—(a1;+ a2). As 
with the Bernoulli proportion, these equations can be solved iteratively. 


Numerical Illustration. Suppose n = 25 and the realized value of X is F = 5; 
hence, nv = 125 events have occurred. We select a, = a2 = 0.05. Then, by (4.3.7), 
our confidence interval solves the equations 


yii24 e no — 0.95 and 373% e-nF OH” — 0.05. (4.3.7) 


Our R function? poissonci.r uses the bisection algorithm to solve these equations. 
Since ppois(124, 25 x 4) = 0.9932 and ppois(124, 25 « 4.4) = 0.9145, for the first 
equation, 4.0 and 4.4 bracket the solution. Here is the call to poissonci.r along 
with the solution (the lower bound of the confidence interval): 


?Download this function at the site given in the Preface. 
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> poissonci(124,25,4,4.4, .95); $solution 4.287836 
Since ppois(125, 25 * 5.5) = 0.1528 and ppois(125, 25 x 6.0) = 0.0204, for the second 
equation, 5.5 and 6.0 bracket the solution. Hence, the computation of the lower 
bound of the confidence interval is: 

> poissonci(125,25,5.5,6,.05); $solution 5.800575 
So the confidence interval is (4.287,5.8), with confidence at least 90%. Note that 
the confidence interval is right-skewed, similar to the Poisson distribution. m 


A brief sketch of the theory behind this confidence interval follows. Consider 
the general setup in the first paragraph of this section, where T is an estimator of 
the unknown parameter @ and F'p(t;@) is the cdf of T. Define 


sup{0: Fip(T;0) > ay} 
inf{@ : Fp(T—;0) <1 — ag}. 


Is ol 
| 


Hence, we have 


6>0 => Fr(T;0) <a 
0<@ => Fr(T-;0) >1- a2. 


These implications lead to 


Pi0<0< 9 1-Pi{a@<dUf{e> oy] 


1-Pl0<6|-P@>9 
1— P|Fr(T-;0) > 1-9] — P[Fr(T;9) < ay] 


1— ay, — aga, 


IV IV 


where the last inequality is evident from equations (4.3.8) and (4.3.9). A rigorous 
proof can be based on Exercise 4.8.13; see page 425 of Shao (1998) for details. 


EXERCISES 


4.3.1. Recall For the baseball data (bb.rda), 15 out of 59 ballplayers are left- 
handed. Let p be the probability that a professional baseball player is left-handed. 
Determine an exact 90% confidence interval for p. Show first that the equations to 
be solved are: 


wy ("81 —9)"-F = 0.95 and S78, (")F (1 — 8)" = 0.05. 
Then do the following steps to obtain the confidence interval. 
(a) Show that 0.10 and 0.17 bracket the solution to the first equation. 
(b) Show that 0.34 and 0.38 bracket the solution to the second equation. 


(c) Then use the R function binomci.r to solve the equations. 


4.3.2. In Example 4.3.2, verify the result for the asymptotic confidence interval for 
0. 
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4.3.3. In Exercise 4.2.20, the large sample confidence interval was obtained for the 
probability that a tack tossed on a table lands point up. Find the discrete exact 
confidence interval for this proportion. 


4.3.4. Suppose X1, X2,...,X io is a random sample on a random variable X that 
has a Poisson distribution with mean 6. Suppose the realized value of the sample 
mean is 0.5; i.e., n= = 5 events occurred. Suppose we want to compute the exact 
90% confidence interval for 0, as determined by equations (4.3.7). 


(a) Show that 0.19 and 0.20 bracket the solution to the first equation. 
(b) Show that 1.0 and 1.1 bracket the solution to the second equation. 
(c) Then use the R function poissonci.r to solve the equations. 


4.3.5. Consider the same setup as in Example 4.3.1 except now assume that o? 
is unknown. Using the distribution of (X — 0)/(S/./n), where S$ is the sample 
standard deviation, set up the equations and derive the t-interval, (4.2.3), for 0. 


4.3.6. Using Exercise 3.3.22, show that 


n 


fara ea tde= 3 (fora 


w=k 


where 0 < p< 1, and k and n are positive integers such that k < n. 

Hint: Differentiate both sides with respect to p. The derivative of the right side is 
a sum of differences. Show it simplifies to the derivative of the left side. Hence, the 
sides differ by a constant. Finally, show that the constant is 0. 


4.3.7. This exercise obtains a useful identity for the cdf of a Poisson cdf. 
(a) Use Exercise 3.3.5 to show that this identity is true: 
n—-1 


A” [ =i ,-@d aga 
—— ge dr = e \— 
P(n) Jy y 


gy 


for A > 0 and n a positive integer. 


Hint: Just consider a Poisson process on the unit interval with mean A. Let 
W,, be the waiting time until the nth event. Then the left side is P(W,, > 1). 
Why? 


(b) Obtain the identity used in Example 4.3.3, by making the transformation 
z = Ax in the above integral. 


4.4 Order Statistics 


In this section the notion of an order statistic is defined and some of its simple 
properties are investigated. These statistics have in recent times come to play an 
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important role in statistical inference partly because some of their properties do not 
depend upon the distribution from which the random sample is obtained. 

Let X1, X2,...,X, denote a random sample from a distribution of the continu- 
ous type having a pdf f(x) that has support S = (a,b), where —c0 <a <b< ow. 
Let Y; be the smallest of these X;, Y2 the next X; in order of magnitude,..., and 
Y, the largest of X;. That is, Y, < Yo <--: < Y, represent X1,Xo,...,X», when 


the latter are arranged in ascending order of magnitude. We call Y;, i = 1,2,...,n, 
the ith order statistic of the random sample X,, X2,...,Xn. Then the joint pdf 
of Y,, Y2,..., Yn is given in the following theorem. 


Theorem 4.4.1. Using the above notation, let Y, < Yo <-+-: < Yn, denote the 
n order statistics based on the random sample X 1, X2,...,Xn from a continuous 
distribution with pdf f(x) and support (a,b). Then the joint pdf of Yi, Y2,...,Yn ts 
given by 


CLC ihe): eae 


! poe aya 
ie) ={ nif (us) Fn) f(Yn) is << Yn <b yyy) 


Proof: Note that the support of X,, X2,..., Xp can be partitioned into n! mutually 
disjoint sets that map onto the support of Y), Y2,..., Yn, namely, {(yi,y2,---;Yn) : 
a<Yyr <Yo<-+++<Yn < dD}. One of these n! sets is a < a1 < 4g < +++ < an <b, 
and the others can be found by permuting the n zs in all possible ways. The 
transformation associated with the one listed is 71 = y ,%2 = Y2,.--;%n = Yn; 
which has a Jacobian equal to 1. However, the Jacobian of each of the other 
transformations is either +1. Thus 


l| 


9(Y15Y25+++5Yn) SVL Fs) Lua) = Un) 


mf (yi) f(y2)-+* fn) a<yr<y2<-++<Yyn<b 
0 elsewhere, 


as was to be proved. m 


Example 4.4.1. Let X denote a random variable of the continuous type with a pdf 
f(x) that is positive and continuous, with support S = (a,b), -co<a<b<o. 
The distribution function F(x) of X may be written 


F(a) = [ f(w)dw, amex 6b: 


Ifa <a, F(x) =0; andifb <a, F(x) =1. Thus there is a unique median m of the 
distribution with F(m) = 5. Let X1, X2, X3 denote a random sample from this 
distribution and let Y; < Y2 < Y3 denote the order statistics of the sample. Note 
that Y2 is the sample median. We compute the probability that Y2 < m. The joint 


pdf of the three order statistics is 


_ J 6f(m)fye)flys) a<y<y<ys<b 
I(Ya Yes Ys) = { 0 elsewhere. 
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The pdf of Y2 is then 


b rye 
h(y2) = 6f(y2 ff f(y1)f (ys) dyidys 
= ( lus) Fw) ll - F(y2)] a<yo<b 
elsewhere. 
Accordingly, 
P(¥e<m) = 6 [” {F(us)s(v2) — P(a)F*F(va)} 
[F(yo)?  [F(y2)2 "1 
- 6 { Rel Per = 


Hence, for this situation, the median of the sample median Y2 is the population 
median m. @ 


Once it is observed that 


and that 


fu — F(w)|?1 f(w) dw = paror B>0, 


it is easy to express the marginal pdf of any order statistic, say Y;,, in terms of F'(2) 
and f(a). This is done by evaluating the integral 


Uk y2 b b 
Cee | se i [- J mipend£002) Fn) dan dee dan og 


The result is 


ne { ream Pv — F(yx))""" fue) @< ye <b (4.4.2) 


elsewhere. 


Example 4.4.2. Let Y; < Yo < Y3 < Y4 denote the order statistics of a random 
sample of size 4 from a distribution having pdf 


fa) ={ 2x O0<a<l 


0 elsewhere. 


We express the pdf of Y3 in terms of f(x) and F(x) and then compute P(5 < Y3). 
Here F(x) = x, provided that 0 < x < 1, so that 


4! 2 
_ om ser (y3)° (1 — y3)(2y3) O<y3 <1 
93(ys) = { 0 elsewhere. 
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Thus 


xX 
1 
A 
ow 
| 


| 93(y3) dys 
1/2 


1 
243 
24(y2 — y2) dy3 = —. 
i- (y3 y3) Y3 256 


Finally, the joint pdf of any two order statistics, say Y; < Y;, is easily expressed 
in terms of F(x) and f(a). We have 


Yi Y2 PUy Yj b b 
ilu) = | af / af / of nlf (yi) x ++ 
a a i Yj-2 4 YG Yn-1 


x f (Yn) dyn +++ dyj+idyj—1 +++ dyizidyi «++ dys—1. 


Since, for y > 0, 


[FW - Pwr sw) dw — 


it is found that 


rer wi) Ey) Fw) 
913 (Yi Yj) = x[1 — F(y;)]"? f(y) f (ys) a<yi<yy <b 
0 elsewhere. 
(4.4.3) 
| 


Remark 4.4.1 (Heuristic Derivation). There is an easy method of remembering 
the pdf of a vector of order statistics such as the one given in formula (4.4.3). The 
probability P(yi < Yi < yi + Ai, yy < Yj < yj + A;), where A; and A, are small, 
can be approximated by the following multinomial probability. In n independent 
trials, 7— 1 outcomes must be less than y; [an event that has probability p; = F'(y;) 
on each trial]; 7 — i — 1 outcomes must be between y; + A; and y; [an event with 
approximate probability po = F(y;) — F(y;) on each trial]; n — 7 outcomes must be 
greater than y; + A, [an event with approximate probability p3 = 1— F(y,;) on each 
trial]; one outcome must be between y; and y; + A; [an event with approximate 
probability pa = f(y;)A; on each trial]; and, finally, one outcome must be between 
y; and y; + A; [an event with approximate probability ps5 = f(y;)Aj; on each trial]. 
This multinomial probability is 


nl i-1,j- 
G—-DIG-i-Din—piui 


ir 


1 n-j 
P3 ~ Paps, 


which is 9;,;(yi,y;)AiA;, where gi,;(y,y;) is given in expression (4.4.3). m 
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Certain functions of the order statistics Y,, Yo,...,Y, are important statistics 
themselves. The sample range of the random sample is given by Y,, — Y; and the 
sample midrange is given by (Y; + Y,,)/2, which is called the midrange of the 
random sample. The sample median of the random sample is defined by 


_ J Yn+y/2 if n is odd 
ae { (Ynj2 + ¥(nj2)41)/2 if n is even. (4.4.4) 


Example 4.4.3. Let Y,, Y2, Y3 be the order statistics of a random sample of size 
3 from a distribution having pdf 


1 O0<a<l 
0 elsewhere. 


fa) = { 


We seek the pdf of the sample range Z; = Y3 — Y;. Since F(a) =x, 0< a <1, the 
joint pdf of Y; and Y3 is 


_ f 6y3-y1) O<y<ys<1 
gis(y1,y3) = { 0 elsewhere. 


In addition to Z, = Y3 — Yi, let Z. = Y3. The functions z; = y3 — y1, z2 = y3 have 
respective inverses y; = 22 — 21, y3 = 22, so that the corresponding Jacobian of the 
one-to-one transformation is 


Oy, OY _ 

pale Be | 
¥3 ¥3 
Oz, 022 0 1 


Thus the joint pdf of Z, and Z2 is 


h(z z) = | — 1|621 = 62) 0<2<2z2<1 
ee i elsewhere. 


Accordingly, the pdf of the range Z, = Y3 — Y; of the random sample of size 3 is 


h (z )= Jy, 61 deg = 6a (1 — x1) 0<2<1 
= 0 elsewhere. 


4.4.1 Quantiles 


Let X be a random variable with a continuous cdf F'(a#). For 0 < p < 1, define 
the pth quantile of X to be & = F~'(p). For example, £o.5, the median of X, is 
the 0.5 quantile. Let X,, X2,...,X, be a random sample from the distribution of 
X and let Y; < Yo <--- < Y, be the corresponding order statistics. Let k be the 
greatest integer less than or equal to [p(n + 1)]. We next define an estimator of €, 
after making the following observation. The area under the pdf f(a) to the left of 
Y;, is F'(Y;,). The expected value of this area is 


b 
E(F(¥%;)) = / F(yn)ou(un) die, 
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where gx(yz) is the pdf of Y, given in expression (4.4.2). If, in this integral, we 
make a change of variables through the transformation z = F'(y,), we have 


E(F(Y;)) = | eo —z)"* dz: 


Comparing this to the integral of a beta pdf, we see that it is equal to 


nik!(n — k)! k; 
HE) = eae aa! ea 


On the average, there is k/(n + 1) of the total area to the left of Y;. Because 
p= k/(n +1), it seems reasonable to take Y; as an estimator of the quantile €). 
Hence, we call Y; the pth sample quantile It is also called the 100pth percentile 
of the sample. 


Remark 4.4.2. Some statisticians define sample quantiles slightly differently from 
what we have. For one modification with 1/(n +1) < p< n/(n +1), if (n+ 1)/p 
is not equal to an integer, then the pth quantile of the sample may be defined as 
follows. Write (n+ 1)p=k-+ 1, where k = [(n + 1)p] and r is a proper fraction, 
using the weighted average. Then the pth quantile of the sample is the weighted 
average 

(1 _ r)Yx + rYp41, (4.4.5) 


which is an estimator of the pth quantile. As n becomes large, however, all these 
modified definitions are essentially the same. For R code, let the R vector x contain 
the realization of the sample. Then the call quantile(x,p) computes a pth quantile 
of form (4.4.5). 


Sample quantiles are useful descriptive statistics. For instance, if yz is the pth 
quantile of the realized sample, then we know that approximately p100% of the data 
are less than or equal to yz, and approximately (1 — p)100% of the data are greater 
than or equal to yz. Next we discuss two statistical applications of quantiles. 

A five-number summary of the data consists of the following five sample quan- 
tiles: the minimum (Y;), the first quartile (Y.25(n41)), the median defined in expres- 
sion (4.4.4), the third quartile (Y75(n41)), and the maximum (Y,,). For this section, 
we use the notation Qi, Q2, and Q3 to denote, respectively, the first quartile, me- 
dian, and third quartile of the sample. 

The five-number summary divides the data into their quartiles, offering a sim- 
ple and easily interpretable description of the data. Five-number summaries were 
made popular by the work of the late Professor John Tukey [see Tukey (1977) and 
Mosteller and Tukey (1977)]. Tukey used the median of the lower half of the data 
(from minimum to median) and the median of the upper half of the data instead 
of the first and third quartiles. He referred to these quantities as the hinges of 
the data. The R function fivenum(x) returns the hinges along with the minimum, 
median, and maximum of the data. 


Example 4.4.4. The following data are the ordered realizations of a random sample 
of size 15 on a random variable X. 
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56 70 89 94 96 101 102 102 
102. 105 106 108 4110 #4113 ~= «116 


For these data, since n + 1 = 16, the realizations of the five-number summary are 
yi = 56, Q1 = ys = 94, Qo = yg = 102, Q3 = yo = 108, and y 15 = 116. Hence, 
based on the five-number summary, the data range from 56 to 116; the middle 50% 
of the data range from 94 to 108; and the middle of the data occurred at 102. The 
data are in the file eg4.4.4data.rda. m 


The five-number summary is the basis for a useful and quick plot of the data. 
This is called a boxplot of the data. The box encloses the middle 50% of the 
data and a line segment is usually used to indicate the median. The extreme order 
statistics, however, are very sensitive to outlying points. So care must be used in 
placing these on the plot. We make use of the box and whisker plots defined by 
John Tukey. In order to define this plot, we need to define a potential outlier. Let 
h = 1.5(Q3 — Q1) and define the lower fence (LF) and the upper fence (UF) by 


LF =Q,—h and UF=Q3+h. (4.4.6) 


Points that lie outside the fences, i.e., outside the interval (LF,UF'), are called 
potential outliers and they are denoted by the symbol “0” on the boxplot. The 
whiskers then protrude from the sides of the box to what are called the adjacent 
points, which are the points within the fences but closest to the fences. Exercise 
4.4.2 shows that the probability of an observation from a normal distribution being 
a potential outlier is 0.006977. 


Example 4.4.5 (Example 4.4.4, Continued). Consider the data given in Example 
4.4.4. For these data, h = 1.5(108 — 94) = 21, LF = 73, and UF = 129. Hence the 
observations 56 and 70 are potential outliers. There are no outliers on the high side 
of the data. The lower adjacent point is 89. The boxplot of the data set is given in 
Panel A of Figure 4.4.1, which was computed by the R segment boxplot (x) where 
the R vector x contains the data. 

Note that the point 56 is over 2h from Q;. Some statisticians call such a point 
an “outlier” and label it with a symbol other than “O,” but we do not make this 
distinction. 


In practice, we often assume that the data follow a certain distribution. For 
example, we may assume that X,,...,X, are a random sample from a normal 
distribution with unknown mean and variance. Thus the form of the distribution 
of X is known, but the specific parameters are not. Such an assumption needs to 
be checked and there are many statistical tests which do so; see D’Agostino and 
Stephens (1986) for a thorough discussion of such tests. As our second statistical 
application of quantiles, we discuss one such diagnostic plot in this regard. 

We consider the location and scale family. Suppose X is a random variable 
with cdf F((# — a)/b), where F(x) is known but a and b > 0 may not be. Let 
Z = (X —a)/b; then Z has cdf F(z). Let 0 < p < 1 and let €x,, be the pth quantile 
of X. Let €z, be the pth quantile of Z = (X — a)/b. Because F(z) is known, €z,, 
is known. But 
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Figure 4.4.1: Boxplot and quantile plots for the data of Example 4.4.4. 


from which we have the linear relationship 


Exp = v&z,p +4. (4.4.7) 


Thus, if X has a cdf of the form of F((a# — a)/b), then the quantiles of X are 
linearly related to the quantiles of Z. Of course, in practice, we do not know the 
quantiles of X, but we can estimate them. Let X,,...,X, be arandom sample from 
the distribution of X and let Yj <--- < Y, be the order statistics. For k = 1,...,n, 
let py = k/(n+1). Then Y;, is an estimator of €x,p,. Denote the corresponding 
quantiles of the cdf F(z) by zp, = F~'(px). Let yx denote the realized value of 
Y;. The plot of yz versus €z,p, is called a q—q plot, as it plots one set of quantiles 
from the sample against another set from the theoretical cdf F(z). Based on the 
above discussion, the linearity of such a plot indicates that the cdf of X is of the 
form F((a — a)/b). 


Example 4.4.6 (Example 4.4.5, Continued). Panels B, C, and D of Figure 4.4.1 
contain gq—q plots of the data of Example 4.4.4 for three different distributions. 
The quantiles of a standard normal random variable are used for the plot in Panel 
B. Hence, as described above, this is the plot of y, versus ®~!(k/(n + 1)), for 
k = 1,2,...,n. For Panel C, the population quantiles of the standard Laplace 
distribution are used; that is, the density of Z is f(z) = (1/2)e7!#!, co < z < 00. 
For Panel D, the quantiles were generated from an exponential distribution with 
density f(z) = e~*, 0 < z < ow, zero elsewhere. The generation of these quantiles 
is discussed in Exercise 4.4.1. 

The plot farthest from linearity is that of Panel D. Note that this plot gives 
an indication of a more correct distribution. For the points to lie on a line, the 
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lower quantiles of Z must be spread out as are the higher quantiles; i.e., symmetric 
distributions may be more appropriate. The plots in Panels B and C are more 
linear than that of Panel D, but they still contain some curvature. Of the two, 
Panel C appears to be more linear. Actually, the data were generated from a 
Laplace distribution, so one would expect that Panel C would be the most linear of 
the three plots. 

Many computer packages have commands to obtain the population quantiles 
used in this example. The R function qqplotc4s2.r, at the site listed in Chapter 
1, obtains the normal, Laplace, and exponential quantiles used for Figure 4.4.1 and 
the plot. The call is qqplotc4s2(x) where the R vector x contains the data. m 


The g—q plot using normal quantiles is often called a normal q—q plot. If the 
data are in the R vector x, the plot is obtained by the call qqnorm(x). 


4.4.2 Confidence Intervals for Quantiles 


Let X be a continuous random variable with cdf F(x). For 0 < p < 1, define the 
100pth distribution percentile to be €, where F'(é,) = p. For a sample of size n on 
X, let Y1 < Yo < +--+ < Y, be the order statistics. Let k = [(n + 1)p]. Then the 
100pth sample percentile Y; is a point estimate of €. 

We now derive a distribution free confidence interval for €, meaning it is a 
confidence interval for €, which is free of any assumptions about F(x) other than 
it is of the continuous type. Let 7 < [(n+1)p] < j, and consider the order statistics 
Y; < Y; and the event Y; < & < Yj. For the ith order statistic Y; to be less than 
€>, it must be true that at least 7 of the X values are less than €,. Moreover, for 
the jth order statistic to be greater than €,, fewer than j of the X values are less 
than €,. To put this in the context of a binomial distribution, the probability of 
success is P(X < &,) = F(&,) = p. Further, the event Y; < & < Yj; is equivalent to 
obtaining between i (inclusive) and 7 (exclusive) successes in n independent trials. 
Thus, taking probabilities, we have 


PU<& <¥%)=> ("ora — pyr. (4.4.8) 


When particular values of n, i, and 7 are specified, this probability can be computed. 
By this procedure, suppose that it has been found that y = P(Y; < & < Y;). Then 
the probability is y that the random interval (Y;, Y;) includes the quantile of order 
p. If the experimental values of Y; and Y; are, respectively, y; and y;, the interval 
(yi, yj) serves as a 1007% confidence interval for €,, the quantile of order p. We use 
this in the next example to find a confidence interval for the median. 


Example 4.4.7 (Confidence Interval for the Median). Let X be a continuous ran- 
dom variable with cdf F(a). Let €:;2 denote the median of F(); ie., 1/2 solves 
F(€1/2) = 1/2. Suppose X1, X2,...,Xn is a random sample from the distribution 
of X with corresponding order statistics Y; < Yo <--- < Y,. As before, let Qe 
denote the sample median, which is a point estimator of €;/2. Select a, so that 
0<a< 1. Take cy/2 to be the a/2th quantile of a binomial b(n, 1/2) distribution; 
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that is, P[S < cy/2] = a/2, where S is distributed b(n, 1/2). Then note also that 
P[S > n— Cy/2] = a/2. (Because of the discreteness of the binomial distribu- 
tion, either take a value of a for which these probabilities are correct or change the 
equalities to approximations.) Thus it follows from expression (4.4.8) that 


P[Youjot1 < €1/2 < Yn-caj2] = 1— a. (4.4.9) 


Hence, when the sample is drawn, if yc, jot and yn—c.,, are the realized values of 


x / 2 
the order statistics Y-,,.41 and Yp— then the interval 


Ca/2? 


(eos Ue ease) (4.4.10) 


is a (1 — a)100% confidence interval for &/2. 

To illustrate this confidence interval, consider the data of Example 4.4.4. Sup- 
pose we want an 88% confidence interval for €;/2. Then a/2 = 0.060. Then cy /2 = 4 
because P[S < 4] =pbinom(4,15,.5)= 0.059, where the distribution of S$ is bino- 
mial with n = 15 and p = 0.5. Therefore, an 88% confidence interval for €1/2 is 
(ys, 11) = (96, 106). 

The R function onesampsgn(x) computes a confidence interval for the median. 
For the data in Example 4.4.4, the code onesampsgn(x,alpha=.12) computes the 
confidence interval (96, 106) for the median. m 


Note that because of the discreteness of the binomial distribution, only certain 
confidence levels are possible for this confidence interval for the median. If we further 
assume that f(a) is symmetric about €, Chapter 10 presents other distribution free 
confidence intervals where this discreteness is much less of a problem. 


EXERCISES 


4.4.1. Obtain closed-form expressions for the distribution quantiles based on the 
exponential and Laplace distributions as discussed in Example 4.4.6. 


4.4.2. Suppose the pdf f(a) is symmetric about 0 with cdf F(x). Show that the 
probability of a potential outlier from this distribution is 2F(4q), where F~+(0.25) = 
qi Use this to obtain the probability that an observation is a potential outlier for 
the following distributions. 


(a) The underlying distribution is normal. Use the N(0,1) distribution. 
(b) The underlying distribution is logistic; that is, the pdf is given by 


—x 


(2) = ape —00 <£< OO. (4.4.11) 


(c) The underlying distribution is Laplace, with the pdf 


f(e)==e"!, -co <<. (4.4.12) 
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4.4.3. Consider the sample of data (data are in the file ex4.4.3data.rda): 


13 5 202 15 99 4 67 £483 36 11 301 
23 213 40 66 106 78 69 166 84 64 


(a) Obtain the five-number summary of these data. 
(b) Determine if there are any outliers. 


(c) Boxplot the data. Comment on the plot. 


4.4.4. Consider the data in Exercise 4.4.3. Obtain the normal q—gq plot for these 
data. Does the plot suggest that the underlying distribution is normal? If not, use 
the plot to determine a more appropriate distribution. Confirm your choice with a 
q—q based on the quantiles using your chosen distribution. 


4.4.5. Let Y; < Yo < Y3 < Y4 be the order statistics of a random sample of size 
4 from the distribution having pdf f(a) = e~*, 0 < x < oo, zero elsewhere. Find 
P(Y4 > 8). 


4.4.6. Let X 1, X2, X3 be a random sample from a distribution of the continuous 
type having pdf f(a) = 2x, 0 < a < 1, zero elsewhere. 


(a) Compute the probability that the smallest of X1,X2, X3 exceeds the median 
of the distribution. 


(b) If ¥; < Y2 < Y3 are the order statistics, find the correlation between Y2 and 
Y3. 


4.4.7. Let f(x) = z; x = 1,2,3,4,5,6, zero elsewhere, be the pmf of a distribution 
of the discrete type. Show that the pmf of the smallest observation of a random 
sample of size 5 from this distribution is 


Chi : 6-y1 ‘ 
= — = 126.056 
gly) ( 6 ) ( 6 > Yl gery eae 


zero elsewhere. Note that in this exercise the random sample is from a distribution 
of the discrete type. All formulas in the text were derived under the assumption 
that the random sample is from a distribution of the continuous type and are not 
applicable. Why? 


4.4.8. Let Y; < Yo < Y3 < Y4 < Ys denote the order statistics of a random sample 
of size 5 from a distribution having pdf f(a) = e~*, 0 < a < ~, zero elsewhere. 
Show that 7, = Yo and Zp = Y4 — Y2 are independent. 

Hint: First find the joint pdf of Y2 and Y4. 


4.4.9. Let Y; < Yo <---< Y, be the order statistics of a random sample of size n 
from a distribution with pdf f(x) = 1, 0 < a < 1, zero elsewhere. Show that the 
kth order statistic Y; has a beta pdf with parameters a= k and G=n—k-+1. 


264 Some Elementary Statistical Inferences 


4.4.10. Let Y; < Yo <--- < Y, be the order statistics from a Weibull distribution, 
Exercise 3.3.26. Find the distribution function and pdf of Y;. 


4.4.11. Find the probability that the range of a random sample of size 4 from the 
uniform distribution having the pdf f(a) = 1, 0 < x < 1, zero elsewhere, is less 
than 4. 


4.4.12. Let Y; < Yo < Y3 be the order statistics of a random sample of size 3 from 
a distribution having the pdf f(x) = 2%, 0 < a < 1, zero elsewhere. Show that 
Z, =¥1/Y2, Z2 = Y2/Y3, and Z3 = Y3 are mutually independent. 


4.4.13. Suppose a random sample of size 2 is obtained from a distribution that has 
pdf f(x) = 2(1— 2), 0< a <1, zero elsewhere. Compute the probability that one 
sample observation is at least twice as large as the other. 


4.4.14. Let Y; < Yo < Y3 denote the order statistics of a random sample of size 
3 from a distribution with pdf f(z) = 1, 0 < a < 1, zero elsewhere. Let Z = 
(Y, + Y3)/2 be the midrange of the sample. Find the pdf of Z. 


4.4.15. Let Y; < Y2 denote the order statistics of a random sample of size 2 from 
N(0, 07). 


(a) Show that E(%1) = —o/,/7. 
Hint: Evaluate E(Y,) by using the joint pdf of Y; and Y2 and first integrating 
on Y1. 


(b) Find the covariance of Y; and Yo. 


4.4.16. Let Y; < Y2 be the order statistics of a random sample of size 2 from a 
distribution of the continuous type which has pdf f(a) such that f(a) > 0, provided 
that « > 0, and f(a) = 0 elsewhere. Show that the independence of Z; = Y; and 
Z2 = Y2 — Y; characterizes the gamma pdf f(x), which has parameters a = 1 and 
B>0. That is, show that Y; and Y2 are independent if and only if f(a) is the pdf 
of a ['(1, @) distribution. 

Hint: Use the change-of-variable technique to find the joint pdf of 7; and Z_ from 
that of Y; and Yj. Accept the fact that the functional equation h(0)h(a + y) = 
h(xz)h(y) has the solution h(x) = cye@*, where c and cp are constants. 


4.4.17. Let Y; < Yo < Y3 < Yq be the order statistics of a random sample of size 
n = 4 from a distribution with pdf f(x) = 2x, 0 < a < 1, zero elsewhere. 


(a) Find the joint pdf of Y3 and Y4. 
(b) Find the conditional pdf of Y3, given Y4 = ys. 
(c) Evaluate E(Y3|y1). 


4.4.18. Two numbers are selected at random from the interval (0,1). If these 
values are uniformly and independently distributed, by cutting the interval at these 
numbers, compute the probability that the three resulting line segments can form 
a triangle. 
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4.4.19. Let X and Y denote independent random variables with respective proba- 
bility density functions f(a) = 2x, 0 < x < 1, zero elsewhere, and g(y) = 3y?, 0 < 
y <1, zero elsewhere. Let U = min(X,Y) and V = max(X,Y). Find the joint pdf 
of U and V. 

Hint: Here the two inverse transformations are given by x = u, y = v and 
L=v, y=u. 


4.4.20. Let the joint pdf of X and Y be f(z, y) = 2x(r+y), 0<r<10<y<l, 
zero elsewhere. Let U = min(X,Y) and V = max(X,Y). Find the joint pdf of U 
and V. 


4.4.21. Let X1, Xo,...,X, be a random sample from a distribution of either type. 
A measure of spread is Gini’s mean difference 


G= ee (). (4.4.13) 


n 
j=2 i=1 


(a) Ifnm = 10, find aj, a2,...,a19 so that G = owe aiY;, where Y1, Y2,..., Yio are 
the order statistics of the sample. 


(b) Show that E(G) = 20/,/7m if the sample arises from the normal distribution 
N(u, 07). 


4.4.22. Let Y; < Yo <--- < Y,, be the order statistics of a random sample of size n 
from the exponential distribution with pdf f(a) = e~*, 0 < x < ~, zero elsewhere. 


(a) Show that 7, = nY1, Z. = (n—1)(¥2-YV1), Z3 = (n— 2)(¥3 —Yo),.-., Zn = 
Y, — Yn—1 are independent and that each Z; has the exponential distribution. 


(b) Demonstrate that all linear functions of Y1, Y2,...,¥n, such as )>/ aiYi, can 
be expressed as linear functions of independent random variables. 


4.4.23. In the Program Evaluation and Review Technique (PERT), we are inter- 
ested in the total time to complete a project that is comprised of a large number of 
subprojects. For illustration, let X,, X2, X3 be three independent random times for 
three subprojects. If these subprojects are in series (the first one must be completed 
before the second starts, etc.), then we are interested in the sum Y = X1+ X94+ X3. 
If these are in parallel (can be worked on simultaneously), then we are interested in 
Z = max(X 1, X2, X3). In the case each of these random variables has the uniform 
distribution with pdf f(a) = 1, 0 < x < 1, zero elsewhere, find (a) the pdf of Y 
and (b) the pdf of Z. 


4.4.24. Let Y,, denote the nth order statistic of a random sample of size n from 
a distribution of the continuous type. Find the smallest value of n for which the 
inequality P(£o.9 < Y;,) > 0.75 is true. 


4.4.25. Let Y; < Yo < Y3 < Y4 < Ys denote the order statistics of a random sample 
of size 5 from a distribution of the continuous type. Compute: 


266 Some Elementary Statistical Inferences 


(a) P(% < Gon < Ys). 
(b) P(Y1 < 0.25 < Y3). 
(c) P(¥4 < £0.80 < Ys). 


4.4.26. Compute P(Y3 < £05 < Y7) if Yi <--- < Yo are the order statistics of a 
random sample of size 9 from a distribution of the continuous type. 


4.4.27. Find the smallest value of n for which P(Y1 < &o.5 < Yn) > 0.99, where Yi < 
--. < Y, are the order statistics of a random sample of size n from a distribution of 
the continuous type. 


4.4.28. Let Y; < Y2 denote the order statistics of a random sample of size 2 from 
a distribution that is N(u,07), where o? is known. 


(a) Show that P(Y; < p < Y2) = § and compute the expected value of the 
random length Y2 — Y;. 


(b) If X is the mean of this sample, find the constant c that solves the equation 
P(X =co < p< X+ee)= s, and compare the length of this random interval 
with the expected value of that of part (a). 


4.4.29. Let yi < yo < y3 be the observed values of the order statistics of a random 
sample of size n = 3 from a continuous type distribution. Without knowing these 
values, a statistician is given these values in a random order, and she wants to select 
the largest; but once she refuses an observation, she cannot go back. Clearly, if she 
selects the first one, her probability of getting the largest is 1/3. Instead, she decides 
to use the following algorithm: She looks at the first but refuses it and then takes 
the second if it is larger than the first, or else she takes the third. Show that this 
algorithm has probability of 1/2 of selecting the largest. 


4.4.30. Refer to Exercise 4.1.1. Using expression (4.4.10), obtain a confidence 
interval (with confidence close to 90%) for the median lifetime of a motor. What 
does the interval mean? 


4.4.31. Let Yj < Yo <--- < Y, denote the order statistics of a random sample of 
size n from a distribution that has pdf f(x) = 327/03, 0 < x < 6, zero elsewhere. 


(a) Show that P(c < Y,/0 < 1) =1-—c", where0<c <1. 


(b) If 7 is 4 and if the observed value of Y4 is 2.3, what is a 95% confidence interval 
for 0? 


4.4.32. Reconsider the weight of professional baseball players in the data file bb. rda. 
Obtain comparison boxplots of the weights of the hitters and pitchers (use the R 
code boxplot (x,y) where x and y contain the weights of the hitters and pitchers, 
respectively). Then obtain 95% confidence intervals for the median weights of the 
hitters and pitchers (use the R function onesampsgn). Comment. 
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4.5 Introduction to Hypothesis Testing 


Point estimation and confidence intervals are useful statistical inference procedures. 
Another type of inference that is frequently used concerns tests of hypotheses. As 
in Sections 4.1 through 4.3, suppose our interest centers on a random variable X 
that has density function f(a7;6), where 6 € Q. Suppose we think, due to theory 
or a preliminary experiment, that 6 € wo or 6 € w 1, where wo and wy, are disjoint 
subsets of Q and wo Uw, = 2. We label these hypotheses as 


Ho: @ € wo versus Hy: 6 € wy. (4.5.1) 


The hypothesis Hp is referred to as the null hypothesis, while Hj, is referred to as 
the alternative hypothesis. Often the null hypothesis represents no change or no 
difference from the past, while the alternative represents change or difference. The 
alternative is often referred to as the research worker’s hypothesis. The decision 
rule to take Hp or H, is based on a sample Xj,...,X,, from the distribution of X 
and, hence, the decision could be wrong. For instance, we could decide that @ € w 1 
when really 6 € wo or we could decide that @ € wo when, in fact, 6 € w,. We label 
these errors Type I and Type II errors, respectively, later in this section. As we 
show in Chapter 8, a careful analysis of these errors can lead in certain situations 
to optimal decision rules. In this section, though, we simply want to introduce the 
elements of hypothesis testing. To set ideas, consider the following example. 


Example 4.5.1 (Zea mays Data). In 1878 Charles Darwin recorded some data 
on the heights of Zea mays plants to determine what effect cross-fertilization or 
self-fertilization had on the height of Zea mays. The experiment was to select one 
cross-fertilized plant and one self-fertilized plant, grow them in the same pot, and 
then later measure their heights. An interesting hypothesis for this example would 
be that the cross-fertilized plants are generally taller than the self-fertilized plants. 
This is the alternative hypothesis, i.e., the research worker’s hypothesis. The null 
hypothesis is that the plants generally grow to the same height regardless of whether 
they were self- or cross-fertilized. Data for 15 pots were recorded. 

We represent the data as (Yi, Z1),..-, (Yis, Z15), where Y; and Z; are the heights 
of the cross-fertilized and self-fertilized plants, respectively, in the ith pot. Let 
X;, = Y; — Z;. Due to growing in the same pot, Y; and Z; may be dependent ran- 
dom variables, but it seems appropriate to assume independence between pots, i.e., 
independence between the paired random vectors. So we assume that Xy,..., X15 
form a random sample. As a tentative model, consider the location model 


X;,=pt+e, = Deans LD, 


where the random variables e; are iid with continuous density f(x). For this model, 
there is no loss in generality in assuming that the mean of e; is 0, for, otherwise, we 
can simply redefine . Hence, E(X;) = w. Further, the density of X; is fx (x; 4) = 
f(a—). In practice, the goodness of the model is always a concern and diagnostics 
based on the data would be run to confirm the quality of the model. 

If wp = E(X;) = 0, then E(Y;) = E(Z;); ie., on average, the cross-fertilized 
plants grow to the same height as the self-fertilized plants. While, if « > 0 then 
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Table 4.5.1: 2 x 2 Decision Table for a Hypothesis Test 


YY True State of Nature 


Reject Ho | Type I Exror 
Accept Ho Type II Error 


E(Y;) > E(Z;); i.e., on average the cross-fertilized plants are taller than the self- 
fertilized plants. Under this model, our hypotheses are 


Ho: =O versus Hy: ps > 0. (4.5.2) 


Hence, wo = {0} represents no difference in the treatments, while w; = (0,00) 
represents that the mean height of cross-fertilized Zea mays exceeds the mean height 
of self-fertilized Zea mays. & 


To complete the testing structure for the general problem described at the be- 
ginning of this section, we need to discuss decision rules. Recall that X),...,Xn 
is a random sample from the distribution of a random variable X that has den- 
sity f(v;@), where 6 € Q. Consider testing the hypotheses Hp : 6 € wo versus 
Ay: 0 € wy, where wp Uw; = 2. Denote the space of the sample by D; that is, 
D = space {(X1,...,Xn)}. A test of Ho versus H; is based on a subset C' of D. 
This set C is called the critical region and its corresponding decision rule (test) 
is 


Reject Ho (Accept H;) if (X41, oes Xn) EC (4.5.3) 
Retain Hp (Reject Hi) — if (X4,...,Xn) € C%. 


For a given critical region, the 2 x 2 decision table as shown in Table 4.5.1, 
summarizes the results of the hypothesis test in terms of the true state of nature. 
Besides the correct decisions, two errors can occur. A Type I error occurs if Hp is 
rejected when it is true, while a Type II error occurs if Ho is accepted when Hj, is 
true. 

The goal, of course, is to select a critical region from all possible critical regions 
which minimizes the probabilities of these errors. In general, this is not possible. 
The probabilities of these errors often have a seesaw effect. This can be seen imme- 
diately in an extreme case. Simply let C = ¢. With this critical region, we would 
never reject Ho, so the probability of Type I error would be 0, but the probability of 
Type IJ error is 1. Often we consider Type I error to be the worse of the two errors. 
We then proceed by selecting critical regions that bound the probability of Type I 
error and then among these critical regions we try to select one that minimizes the 
probability of Type II error. 


Definition 4.5.1. We say a critical region C is of size a if 


a= max Py[(X1,..-,Xn) € C]. (4.5.4) 
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Over all critical regions of size a, we want to consider critical regions that have 
lower probabilities of Type II error. We also can look at the complement of a Type 
I error, namely, rejecting Hp when Hj is true, which is a correct decision, as marked 
in Table 4.5.1. Since we desire to maximize the probability of this latter decision, 
we want the probability of it to be as large as possible. That is, for 9 € w1, we want 
to maximize 

1 — Po[Type II Error] = Po[(Xq,..., Xn) € C}. 


The probability on the right side of this equation is called the power of the test 
at @. It is the probability that the test detects the alternative 6 when 6 © w is 
the true parameter. So minimizing the probability of Type II error is equivalent to 
maximizing power. 

We define the power function of a critical region to be 


§6(0) = PK RV SCH Peay: (4.5.5) 


Hence, given two critical regions C, and C2, which are both of size a, C, is better 
than C2 if yc, (0) > yc,(@) for all 6 € w;. In Chapter 8, we obtain optimal critical 
regions for specific situations. In this section, we want to illustrate these concepts 
of hypothesis testing with several examples. 


Example 4.5.2 (Test for a Binomial Proportion of Success). Let X be a Bernoulli 
random variable with probability of success p. Suppose we want to test, at size a, 


Ho: p= po versus Hi: p< po, (4.5.6) 


where po is specified. As an illustration, suppose “success” is dying from a certain 
disease and po is the probability of dying with some standard treatment. A new 
treatment is used on several (randomly chosen) patients, and it is hoped that the 
probability of dying under this new treatment is less than pp. Let X1,...,Xn be 
a random sample from the distribution of X and let S = 57", X; be the total 
number of successes in the sample. An intuitive decision rule (critical region) is 


Reject Ho in favor of Hy; if S<k, (4.5.7) 


where & is such that a = Py,[S < k]. Since S has a b(n, po) distribution under Ho, 
k is determined by a = P,,[S < k]. Because the binomial distribution is discrete, 
however, it is likely that there is no integer k that solves this equation. For example, 
suppose n = 20, po = 0.7, and a = 0.15. Then under Ho, S has a binomial b(20, 0.7) 
distribution. Hence, computationally, Py,[S < 11] =pbinom(11,20,0.7)= 0.1133 
and Py,[S < 12] =pbinom(12,20,0.7)= 0.2277. Hence, erring on the conservative 
side, we would probably choose k to be 11 and a = 0.1133. As n increases, this is 
less of a problem; see, also, the later discussion on p-values. In general, the power 
of the test for the hypotheses (4.5.6) is 


Y(p) = PS <k], p<po. (4.5.8) 


The curve labeled Test 1 in Figure 4.5.1 is the power function for the case n = 20, 
po = 0.7, and a = 0.1133. Notice that the power function is decreasing. The 
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power is higher to detect the alternative p = 0.2 than p = 0.6. In Section 8.2, we 
prove in general the monotonicity of the power function for binomial tests of these 
hypotheses. Using this monotonicity, we extend our test to the more general null 
hypothesis Hp : p > po rather than simply Ho : p = po. Using the same decision 
rule as we used for the hypotheses (4.5.6), the definition of the size of a test (4.5.4), 
and the monotonicity of the power curve, we have 


max P,[S <k] = P,,[S <k] =a, 
P2Po 


i.e., the same size as for the original null hypothesis. 


y(p) 


0.8 + 


0.4 + 


Test 1: sizea= 0.113 


Figure 4.5.1: Power curves for tests 1 and 2; see Example 4.5.2. 


Denote by Test 1 the test for the situation with n = 20, po = 0.70, and size 
a = 0.1133. Suppose we have a second test (Test 2) with an increased size. How 
does the power function of Test 2 compare to Test 1? As an example, suppose 
for Test 2, we select a = 0.2277. Hence, for Test 2, we reject Ho if S < 12. 
Figure 4.5.1 displays the resulting power function. Note that while Test 2 has a 
higher probability of committing a Type I error, it also has a higher power at each 
alternative p < 0.7. Exercise 4.5.7 shows that this is true for these binomial tests. 
It is true in general; that is, if the size of the test increases, power does too. For 
this example, the R function binpower.r, found at the site listed in the Preface, 
produces a version of Figure 4.5.1. m 


Remark 4.5.1 (Nomenclature). Since in Example 4.5.2, the first null hypothesis 
Ho : p= po completely specifies the underlying distribution, it is called a simple 
hypothesis. Most hypotheses, such as Hy : p < po, are composite hypotheses, 
because they are composed of many simple hypotheses and, hence, do not completely 
specify the distribution. 

As we study more and more statistics, we discover that often other names are 
used for the size, a, of the critical region. Frequently, a is also called the signifi- 
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cance level of the test associated with that critical region. Moreover, sometimes 
a is called the “maximum of probabilities of committing an error of Type I” and 
the “maximum of the power of the test when Hp is true.” It is disconcerting to the 
student to discover that there are so many names for the same thing. However, all 
of them are used in the statistical literature, and we feel obligated to point out this 
fact. 


The test in the last example is based on the exact distribution of its test statistic, 
i.e., the binomial distribution. Often we cannot obtain the distribution of the test 
statistic in closed form. As with approximate confidence intervals, however, we can 
frequently appeal to the Central Limit Theorem to obtain an approximate test; see 
Theorem 4.2.1. Such is the case for the next example. 


Example 4.5.3 (Large Sample Test for the Mean). Let X be a random variable 
with mean p and finite variance g?. We want to test the hypotheses 


Ho: ~6 = wo versus Hy: > Lo, (4.5.9) 


where [Up is specified. To illustrate, suppose zo is the mean level on a standardized 
test of students who have been taught a course by a standard method of teaching. 
Suppose it is hoped that a new method that incorporates computers has a mean 
level 4 > fo, where pp = E(X) and X is the score of a student taught by the new 
method. This conjecture is tested by having n students (randomly selected) taught 
under this new method. 

Let X1,...,X, be a random sample from the distribution of X and denote the 
sample mean and variance by X and $7, respectively. Because X is an unbiased 
estimate of jz, an intuitive decision rule is given by 


Reject Ho in favor of H; if X is much larger than po. (4.5.10) 


In general, the distribution of the sample mean cannot be obtained in closed form. 
In Example 4.5.4, under the strong assumption of normality for the distribution of 
X, we obtain an exact test. For now, the Central Limit Theorem (Theorem 4.2.1) 
shows that the distribution of (X — )/(S/\/n) is approximately N(0,1). Using 
this, we obtain a test with an approximate size a, with the decision rule 
x 

Sia > Be: (4.5.11) 
The test is intuitive. To reject Ho, X must exceed pio by at least zyS//n. To 
approximate the power function of the test, we use the Central Limit Theorem. 
Upon substituting o for S, it readily follows that the approximate power function 
is 


Reject Ho in favor of Hy, if 


Yu) = Pu(X 2 wo + 2a0/vn) 


es 
| 
imal 
oo 
e 
+ 
Fi 
— 
=] 
| 
=. 
Sa, 


= 6 (-H = oe (4.5.12) 
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So if we have some reasonable idea of what a equals, we can compute the approxi- 
mate power function. As Exercise 4.5.1 shows, this approximate power function is 
strictly increasing in py, so as in the last example, we can change the null hypotheses 
to 

Ho: & < po versus Hy: pp > Uo. (4.5.13) 


Our asymptotic test has approximate size a for these hypotheses. m 


Example 4.5.4 (Test for 4 Under Normality). Let X have a N(, 07) distribution. 
As in Example 4.5.3, consider the hypotheses 


Ho: 6 = wo versus Hy: > Lo, (4.5.14) 


where jlo is specified. Assume that the desired size of the test is a, for0<a<1, 
Suppose X1,...,X, is a random sample from a N(, 07) distribution. Let X and 
5S? denote the sample mean and variance, respectively. Our intuitive rejection rule 
is to reject Ho in favor of H, if X is much larger than jg. Unlike Example 4.5.3, we 
now know the distribution of the statistic X. In particular, by Part (d) of Theorem 
3.6.1, under Ho the statistic T = (X — po)/(S/\/n) has a t-distribution with n — 1 
degrees of freedom. Using the distribution of T, it follows that this rejection rule 
has exact level a: 


Reject Ho in favor of Hy if T = S=#2 > tan—1, (4.5.15) 


where tg.n—1 is the upper a critical point of a t-distribution with n — 1 degrees of 
freedom; i.e., a = P(T > tan—1). This is often called the t-test of Hp : uw = po. 

Note the differences between this rejection rule and the large sample rule, (4.5.11). 
The large sample rule has approximate level a, while this has exact level a. Of 
course, we now have to assume that X has a normal distribution. In practice, we 
may not be willing to assume that the population is normal. Usually ¢-critical val- 
ues are larger than z-critical values; hence, the t-test is conservative relative to the 
large sample test. So, in practice, many statisticians often use the t-test. 

The R code t. test (x,mu=mu0,alt="greater") computes the t-test for the hy- 
potheses (4.5.14), where the R vector x contains the sample. m 


Example 4.5.5 (Example 4.5.1, Continued). The data for Darwin’s experiment 
on Zea mays are recorded in Table 4.5.2 and are, also, in the file darwin.rda. A 
boxplot and a normal q—q plot of the 15 differences, x; = y; — z;, are found in 
Figure 4.5.2. Based on these plots, we can see that there seem to be two outliers, 
Pots 2 and 15. In these two pots, the self-fertilized Zea mays are much taller than 
their cross-fertilized pairs. Except for these two outliers, the differences, y; — z;, are 
positive, indicating that the cross-fertilization leads to taller plants. We proceed 
to conduct a test of hypotheses (4.5.2), as discussed in Example 4.5.4. We use the 
decision rule given by (4.5.15) with a = 0.05. As Exercise 4.5.2 shows, the values 
of the sample mean and standard deviation for the differences, x;, are T = 2.62 
and s, = 4.72. Hence the t-test statistic is 2.15, which exceeds the t-critical value, 
tos,14 =qt(0.95,14)= 1.76. Thus we reject Ho and conclude that cross-fertilized 
Zea mays are on the average taller than self-fertilized Zea mays. Because of the 
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Table 4.5.2: Plant Growth 


Pot 1 2 3 4 5 6 7 8 

Cross | 23.500 12.000 21.000 22.000 19.125 21.500 22.125 20.375 

Self 17.375 20.375 20.000 20.000 18.375 18.625 18.625 15.250 
Pot 9 10 11 12 13 14 15 
Cross | 18.250 21.625 23.250 21.000 22.125 23.000 12.000 
Self 16.500 18.000 16.250 18.000 12.750 15.500 18.000 


outliers, normality of the error distribution is somewhat dubious, and we use the 
test in a conservative manner, as discussed at the end of Example 4.5.4. 

Assuming that the rda file darwin.rda has been loaded in R, the code for the 
above t-test is t.test (cross-self ,mu=0,alt="greater") which evaluates the t- 
test statistic to be 2.1506. 
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Figure 4.5.2: Boxplot and normal q—q plot for the data of Example 4.5.5. 


EXERCISES 
In many of these exercises, use R or another statistical package for computations 
and graphs of power functions. 


4.5.1. Show that the approximate power function given in expression (4.5.12) of 
Example 4.5.3 is a strictly increasing function of 1. Show then that the test discussed 
in this example has approximate size a for testing 


Ho: [6 < po versus Hy: > po. 


4.5.2. For the Darwin data tabled in Example 4.5.5, verify that the Student t-test 
statistic is 2.15. 


4.5.3. Let X have a pdf of the form f(x;0) = 02°-!, 0 < x < 1, zero elsewhere, 
where 6 € {0: 6 = 1,2}. To test the simple hypothesis Hp : 6 = 1 against the 
alternative simple hypothesis H, : @ = 2, use a random sample Xj, X2 of size n = 2 
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and define the critical region to be C = {(21, 22) : + < x 2%}. Find the power 
function of the test. 


4.5.4. Let X have a binomial distribution with the number of trials n = 10 and 
with p either 1/4 or 1/2. The simple hypothesis Hp : p = - is rejected, and the 
alternative simple hypothesis Hy : p = - is accepted, if the observed value of Xj, a 
random sample of size 1, is less than or equal to 3. Find the significance level and 


the power of the test. 


4.5.5. Let X1,X2 be a random sample of size n = 2 from the distribution having 
pdf f(x;0) = (1/0)e-*/", 0 < x < ow, zero elsewhere. We reject Ho : 0 = 2 and 
accept H, : 6 = 1 if the observed values of X1, X2, say x1, 22, are such that 


f (x1; 2)f (x25 2) 
f (x13 1) f(@251) 


Here 2 = {0:0 = 1,2}. Find the significance level of the test and the power of the 
test when Hp is false. 


4.5.6. Consider the tests Test 1 and Test 2 for the situation discussed in Example 
4.5.2. Consider the test that rejects Ho if S < 10. Find the level of significance for 
this test and sketch its power curve as in Figure 4.5.1. 


4.5.7. Consider the situation described in Example 4.5.2. Suppose we have two 
tests A and B defined as follows. For Test A, Ho is rejected if S < k,4, while for 
Test B, Ho is rejected if S < kg. If Test A has a higher level of significance than 
Test B, show that Test A has higher power than Test B at each alternative. 


4.5.8. Let us say the life of a tire in miles, say X, is normally distributed with mean 
@ and standard deviation 5000. Past experience indicates that 0 = 30,000. The 
manufacturer claims that the tires made by a new process have mean @ > 30,000. 
It is possible that 6 = 35,000. Check his claim by testing Ho : 6 = 30,000 against 
HA, : 6 > 30,000. We observe n independent values of X, say x1,...,%n, and we 
reject Ho (thus accept H,) if and only ifZ > c. Determine n and c so that the power 
function 7(@) of the test has the values 7(30,000) = 0.01 and (35,000) = 0.98. 


4.5.9. Let X have a Poisson distribution with mean @. Consider the simple hy- 
pothesis Hp : 0 = s and the alternative composite hypothesis H, : 0 < 4. Thus 
Q={0:0<0< 4}. Let X1,...,X 2 denote a random sample of size 12 from this 
distribution. We reject Ho if and only if the observed value of Y = X,+---+Xy2 < 2. 
Show that the following R code graphs the power function of this test: 
theta=seq(.1,.5,.05); gam=ppois(2,theta*12) 
plot (gam~theta,pch="_",xlab=expression(theta) ,ylab=expression (gamma) ) 
lines (gam~ theta) 
Run the code. Determine the significance level from the plot. 


4.5.10. Let Y have a binomial distribution with parameters n and p. We reject 
Ho: p= 5 and accept Hy : p> - if Y >c. Find n and c to give a power function 
1(p) which is such that 7(3) = 0.10 and (2) = 0.95, approximately. 
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4.5.11. Let Y; < Yo < Y3 < Yq be the order statistics of a random sample of size 
n = 4 from a distribution with pdf f(#;0) = 1/0, 0 < x < @, zero elsewhere, where 
0 < 0. The hypothesis Hp : 8 = 1 is rejected and H, : 6 > 1 is accepted if the 
observed Y4 > c. 


(a) Find the constant c so that the significance level is a = 0.05. 
(b) Determine the power function of the test. 


4.5.12. Let X 1, X2,...,Xg be a random sample of size n = 8 from a Poisson 
distribution with mean pu. Reject the simple null hypothesis Hp : ww = 0.5 and 
accept Hy, : 4 > 0.5 if the observed sum S7*_, 2; > 8. 


(a) Show that the significance level is 1-ppois(7,8*.5). 
(b) Use R to determine 7(0.75), y(1), and y(1.25). 
(c) Modify the code in Exercise 4.5.9 to obtain a plot of the power function. 


4.5.13. Let p denote the probability that, for a particular tennis player, the first 
serve is good. Since p = 0.40, this player decided to take lessons in order to increase 
p. When the lessons are completed, the hypothesis Ho : p = 0.40 is tested against 
A, : p > 0.40 based on n = 25 trials. Let Y equal the number of first serves that 
are good, and let the critical region be defined by C = {Y : Y > 13}. 


(a) Show that a is computed by a =1-pbinom(12, 25, .4). 


(b) Find 6 = P(Y < 13) when p = 0.60; that is, @ = P(Y < 12; p = 0.60) so 
that 1 — G is the power at p = 0.60. 


4.5.14. Let S denote the number of success in n = 40 Bernoulli trials with prob- 
ability of success p. Consider the hypotheses: Hp : p < 0.3 versus Hy : p > 0.3. 
Consider the two tests: (1) Reject Ho if S > 16 and (2) Reject Ho if S > 17. 
Determine the level of these tests. The R function binpower.r produces a version 
of Figure 4.5.1. For this exercise, write a similar R function that graphs the power 
functions of the above two tests. 


4.6 Additional Comments About Statistical Tests 


All of the alternative hypotheses considered in Section 4.5 were one-sided hy- 
potheses. For illustration, in Exercise 4.5.8 we tested Ho : w = 30,000 against the 
one-sided alternative H, : 4 > 30,000, where py is the mean of a normal distribution 
having standard deviation o0 = 5000. Perhaps in this situation, though, we think 
the manufacturer’s process has changed but are unsure of the direction. That is, 
we are interested in the alternative H; : yw A 30,000. In this section, we further ex- 
plore hypotheses testing and we begin with the construction of a test for a two-sided 
alternative. 


276 Some Elementary Statistical Inferences 


Example 4.6.1 (Large Sample Two-Sided Test for the Mean). In order to see how 
to construct a test for a two-sided alternative, reconsider Example 4.5.3, where we 
constructed a large sample one-sided test for the mean of a random variable. As 
in Example 4.5.3, let X be a random variable with mean p and finite variance o?. 
Here, though, we want to test 


Ho: 6 = po versus Hy: uw # Lo, (4.6.1) 


where [Uo is specified. Let X1,...,X, be a random sample from the distribution of 
X and denote the sample mean and variance by X and S$”, respectively. For the 
one-sided test, we rejected Ho if X was too large; hence, for the hypotheses (4.6.1), 
we use the decision rule 


Reject Ho in favor of H, if X <hor X >k, (4.6.2) 


where h and k are such that a = Py,[X <h or X > k]. Clearly, h < k; hence, we 
have 
a = Py,[X < hor X > k] = Py, [X < h] + Pu, [X > kl. 


Since, at least for large samples, the distribution of X is symmetrically distributed 
about fg, under Ho, an intuitive rule is to divide a equally between the two terms 
on the right side of the above expression; that is, h and k are chosen by 


Py, (X <h] =a/2 and Py,[X > k] = a/2. (4.6.3) 


From Theorem 4.2.1, it follows that (X — o)/(S/V/7) is approximately N(0, 1). 
This and (4.6.3) lead to the approximate decision rule 


Reject Ho in favor of Hy if S)Un 


ae Smit (4.6.4) 


Upon substituting o for S, it readily follows that the approximate power function 
is 


ed SS Ho — Zaj20/Vn) oe Pie = Lo s Za/20/Vn) 
® — = “a/2) 41-6 —— + aja) , (4.6.5) 


y(n) 


oO 


where ®(z) is the cdf of a standard normal random variable; see (3.4.9). So if we 
have some reasonable idea of what o equals, we can compute the approximate power 
function. Note that the derivative of the power function is 


vio F [6 (=P 4 ayn) —o( =O ayn), 08 


on (on 


where $(z) is the pdf of a standard normal random variable. Then we can show that 
(4) has a critical value at fo which is the minimum; see Exercise 4.6.2. Further, 
(2) is strictly decreasing for 4 < 9 and strictly increasing for 4 > uo. 
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Consider again the situation at the beginning of this section. Suppose we want 
to test 


Ho: = 30,000 versus Hy: pu 4 30,000. (4.6.7) 
Suppose n = 20 and a = 0.01. Then the rejection rule (4.6.4) becomes 
Reject Ho in favor of H; if en > 2.575. (4.6.8) 


Figure 4.6.1 displays the power curve for this test when o = 5000 is substituted 
in for S. For comparison, the power curve for the test with level a = 0.05 is also 
shown. The R function zpower computes a version of this figure. 
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Figure 4.6.1: Power curves for the tests of the hypotheses (4.6.7). 


This two-sided test for the mean is approximate. If we assume that X has a 
normal distribution, then, as Exercise 4.6.3 shows, the following test has exact size 
a for testing Ho : ps = fo versus Hy: uu # Lo: 


Reject Ho in favor of Hy if 


Ft | > tayant: (4.6.9) 


It too has a bowl-shaped power curve similar to Figure 4.6.1, although it is not as 
easy to show; see Lehmann (1986). 

For computation in R, the code t.test (x,mu=mu0) obtains the two-sided t-test 
of hypotheses (4.6.1), when the R vector x contains the sample. 

There exists a relationship between two-sided tests and confidence intervals. 
Consider the two-sided t-test (4.6.9). Here, we use the rejection rule with “if and 
only if” replacing “if.” Hence, in terms of acceptance, we have 


Accept Hp if and only if 9 — ta/2n—19//n < X < pot te/2n—19/J/n. 
But this is easily shown to be 


Accept Ho if and only if jug € (X — ta/2n—1S//n, X + taj2n—18/Vn); (4.6.10) 
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that is, we accept Ho at significance level a if and only if zp is in the (1 — a)100% 
confidence interval for 4. Equivalently, we reject Ho at significance level a if and 
only if juo is not in the (1 — a)100% confidence interval for y». This is true for all 
the two-sided tests and hypotheses discussed in this text. There is also a similar 
relationship between one-sided tests and one-sided confidence intervals. 

Once we recognize this relationship between confidence intervals and tests of 
hypothesis, we can use all those statistics that we used to construct confidence 
intervals to test hypotheses, not only against two-sided alternatives but one-sided 
ones as well. Without listing all of these in a table, we present enough of them so 
that the principle can be understood. 


Example 4.6.2. Let independent random samples be taken from N(j1,07) and 
N (112,07), respectively. Say these have the respective sample characteristics 71, 
X, S? and no, Y, $3. Let n = ny + ng denote the combined sample size and let 
S2 = [(m — 1)S7 + (nz — 1)S3]/(n — 2), (4.2.11), be the pooled estimator of the 
common variance. At a = 0.05, reject Ho : Wi = fg and accept the one-sided 
alternative Hy : 4) > plo if 


xX-Y 
SS > t.05,n—25 
Sp Vm + 2 


because, under Ho : 41 = 2, T has a t(n — 2)-distribution. A rigorous development 
of this test is given in Example 8.3.1. m 


T= 


I-|2 


3 


Example 4.6.3. Say X is b(1,p). Consider testing Ho : p = po against Hy : p < po. 
Let X,...,Xp be a random sample from the distribution of X and let p= X. To 
test Ho versus Hy, we use either 


ee a EE or | ee 
po(1 — po)/n p(l—p)/n 


If n is large, both Z, and Z2 have approximate standard normal distributions pro- 
vided that Ho : p = po is true. Hence, if c is set at —1.645, then the approximate 
significance level is a = 0.05. Some statisticians use Z, and others Z2. We do 
not have strong preferences one way or the other because the two methods provide 
about the same numerical results. As one might suspect, using Z, provides better 
probabilities for power calculations if the true p is close to po, while Z is better 
if Ho is clearly false. However, with a two-sided alternative hypothesis, Z2 does 
provide a better relationship with the confidence interval for p. That is, |Z2| < za/2 
is equivalent to po being in the interval from 
owes pU-d) ee pu B) 
n n 

which is the interval that provides a (1 — a)100% approximate confidence interval 
for p as considered in Section 4.2. m 


In closing this section, we introduce the concept of randomized tests. 
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Example 4.6.4. Let X,, X2,...,X19 be a random sample of size n = 10 from a 


Poisson distribution with mean @. A critical region for testing Hp : 0 = 0.1 against 


A, :0> 0.1 is given by Y = y Xj; > 3. The statistic Y has a Poisson distribution 


with mean 100. Thus, with 0 = 0.1 so that the mean of Y is 1, the significance level 
of the test is 


P(Y >3)=1-P(Y <2) =1-— ppois(2,1) = 1 — 0.920 = 0.080. 


If, on the other hand, the critical region defined by in x; > 4 is used, the signifi- 
cance level is 


a= P(Y >4)=1-P(Y <3) =1-— ppois(3,1) = 1 — 0.981 = 0.019. 


For instance, if a significance level of about a = 0.05, say, is desired, most statisti- 
cians would use one of these tests; that is, they would adjust the significance level 
to that of one of these convenient tests. However, a significance level of a = 0.05 
can be achieved in the following way. Let W have a Bernoulli distribution with 
probability of success equal to 


_ 0.050—0.019 _ 31 


Fragen ek 
( )= 9980—0.019 ~ Gi 


Assume that W is selected independently of the sample. Consider the rejection rule 
Reject Ho if 3}° a; > 4 or if }° ae; =3 and W =1. 


The significance level of this rule is 


I 


Pu (Y 2 4) + Pu ({¥ = 3} {W = 1}) Pu (¥ 2 4) 


+ Py,(Y =3)P(W =1) 
= 0.019+ 0.0615 = 0.05; 


hence, the decision rule has exactly level 0.05. The process of performing the auxil- 
iary experiment to decide whether to reject or not when Y = 3 is sometimes referred 
to as a randomized test. m 


4.6.1 Observed Significance Level, p-value 


Not many statisticians like randomized tests in practice, because the use of them 
means that two statisticians could make the same assumptions, observe the same 
data, apply the same test, and yet make different decisions. Hence, they usually 
adjust their significance level so as not to randomize. As a matter of fact, many 
statisticians report what are commonly called observed significance levels or 
p-values (for probability values). 

A general example suffices to explain observed significance levels. Let X1,...,Xn 
be arandom sample from a N(j1, 07) distribution, where both jz and o? are unknown. 
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Consider, first, the one-sided hypotheses Ho : w = to versus Hy : > fo, where po 
is specified. Write the rejection rule as 


Reject Ho in favor of Hy, if X > k, (4.6.11) 


where X is the sample mean. Previously we have specified a level and then solved 
for k. In practice, though, the level is not specified. Instead, once the sample 
is observed, the realized value F of X is computed and we ask the question: Is 7 
sufficiently large to reject Ho in favor of H,? To answer this we calculate the p-value 
which is the probability, 

p-value = Py,(X > 2). (4.6.12) 


Note that this is a data-based “significance level” and we call it the observed 
significance level or the p-value. The hypothesis Ho is rejected at all levels greater 
than or equal to the p-value. For example, if the p-value is 0.048, and the nominal 
a level is 0.05 then Hp would be rejected; however, if the nominal a level is 0.01, 
then Hp would not be rejected. In summary, the experimenter sets the hypotheses; 
the statistician selects the test statistic and rejection rule; the data are observed 
and the statistician reports the p-value to the experimenter; and the experimenter 
decides whether the p-value is sufficiently small to warrant rejection of Ho in favor 
of H,. The following example provides a numerical illustration. 


Example 4.6.5. Recall the Darwin data discussed in Example 4.5.5. It was a 
paired design on the heights of cross and self-fertilized Zea mays plants. In each of 
15 pots, one cross-fertilized and one self-fertilized were grown. The data of interest 
are the 15 paired differences, (cross — self). As in Example 4.5.5, let X; denote the 
paired difference for the zth pot. Let yz be the true mean difference. The hypotheses 
of interest are Ho : 44 = 0 versus Hy : pp > 0. The standardized rejection rule is 


Reject Ho in favor of Hy, if T > k, 


where T = X/(S/\/15), where X and S are respectively the sample mean and 
standard deviation of the differences. The alternative hypothesis states that on the 
average cross-fertilized plants are taller than self-fertilized plants. From Example 
4.5.5 the t-test statistic has the value 2.15. Letting ¢(14) denote a random variable 
with the t-distribution with 14 degrees of freedom, and using R the p-value for the 
experiment is 


P[t(14) > 2.15] =1— pt(2.15,14) = 1— 0.9752 = 0.0248. (4.6.13) 


In practice, with this p-value, Hp would be rejected at all levels greater than or 
equal to 0.0248. This observed significance level is also part of the output from the 
R call t. test (cross-self ,mu=0,alt="greater"). 


Returning to the discussion above, suppose the hypotheses are Ho : uw = Lo 
versus H, : ww < po. Obviously, the observed significance level in this case is 
p-value = Py,(X <7). For the two-sided hypotheses Ho : = io versus Hy: p # 
J4o, our “unspecified” rejection rule is 


Reject Ho in favor of Hy, if X <lor X >k. (4.6.14) 
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For the p-value, compute each of the one-sided p-values, take the smaller p-value, and 
double it. For an illustration, in the Darwin example, suppose the the hypotheses 
are Ho : « = 0 versus Hy : wp #0. Then the p-value is 2(0.0248) = 0.0496. As 
a final note on p-values for two-sided hypotheses, suppose the test statistic can 
be expressed in terms of a t-test statistic. In this case the p-value can be found 
equivalently as follows. If d is the realized value of the t-test statistic then the 
p-value is 

p-value = Py, ||t| > |dl], (4.6.15) 


where, under Ho, t has a ¢-distribution with n — 1 degrees of freedom. 
In this discussion on p-values, keep in mind that good science dictates that the 
hypotheses should be known before the data are drawn. 


EXERCISES 


4.6.1. The R function zpower, found at the site listed in the Preface, computes 
the plot in Figure 4.6.1. Consider the two-sided test for proportions discussed in 
Example 4.6.3 based on the test statistic Z,. Specifically consider the hypotheses 
Ho : p = .0.6 versus H; : p # 0.6. Using the sample size n = 50 and the level 
a = 0.05, write a R program, similar to zpower, which computes a plot of the 
power curve for this test on a proportion. 


4.6.2. Consider the power function y(j:) and its derivative y'(~) given by (4.6.5) 
and (4.6.6). Show that ¥/(y) is strictly negative for 2 < ~o and strictly positive for 


> Lo. 


4.6.3. Show that the test defined by 4.6.9 has exact size a for testing Hp: = Lo 
versus Hy: uw 4 Lo. 


4.6.4. Consider the one-sided t-test for Hp : w = po versus Ha, : > po con- 
structed in Example 4.5.4 and the two-sided t-test for t-test for Hp : js = luo versus 
Hy: 4 po given in (4.6.9). Assume that both tests are of size a. Show that for 
[4 > Lo, the power function of the one-sided test is larger than the power function 
of the two-sided test. 


4.6.5. On page 373 Rasmussen (1992) discussed a paired design. A baseball coach 
paired 20 members of his team by their speed; i.e., each member of the pair has 
about the same speed. Then for each pair, he randomly chose one member of the 
pair and told him that if could beat his best time in circling the bases he would 
give him an award (call this response the time of the “self” member). For the other 
member of the pair the coach’s instruction was an award if he could beat the time 
of the other member of the pair (call this response the time of the “rival” member). 
Each member of the pair knew who his rival was. The data are given below, but are 
also in the file selfrival.rda. Let qa be the true difference in times (rival minus 
self) for a pair. The hypotheses of interest are Ho : wa = 0 versus Hy : wa < 0. The 
data are in order by pairs, so do not mix the order. 


self: 16.20 16.78 17.38 17.59 17.37 17.49 18.18 18.16 18.36 18.53 
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15.92 16.58 17.57 16.75 17.28 17.32 17.51 17.58 18.26 17.87 


rival: 15.95 16.15 17.05 16.99 17.34 17.53 17.34 17.51 18.10 18.19 
16.04 16.80 17.24 16.81 17.11 17.22 17.33 17.82 18.19 17.88 


(a) Obtain comparison boxplots of the data. Comment on the comparison plots. 
Are there any outliers? 


(b) Compute the paired t-test and obtain the p-value. Are the data significant at 
the 5% level of significance? 


(c) Obtain a point estimate of 4g and a 95% confidence interval for it. 


(d) Conclude in terms of the problem. 


4.6.6. Verzani (2014), page 323, presented a data set concerning the effect that 
different dosages of the drug AZT have on patients with HIV. The responses we 
consider are the p24 antigen levels of HIV patients after their treatment with AZT. 
Of the 20 HIV patients in the study, 10 were randomly assign the dosage of 300 mg 
of AZT while the other 10 were assigned 600 mg. The hypotheses of interest are 
Hy: A =0 versus H, : A £ 0 where A = pgo0 — L300 and fg00 and f4300 are the true 
mean p24 antigen levels under dosages of 600 mg and 300 mg of AZT, respectively. 
The data are given below but are also available in the file aztdoses.rda. 


300 mg | 284 279 289 292 287 295 285 279 306 298 
600 mg | 298 307 297 279 291 335 299 300 306 291 


(a) Obtain comparison boxplots of the data. Identify outliers by patient. Com- 
ment on the comparison plots. 


(b) Compute the two-sample t-test and obtain the p-value. Are the data signifi- 
cant at the 5% level of significance? 


(c) Obtain a point estimate of A and a 95% confidence interval for it. 


(d) Conclude in terms of the problem. 


4.6.7. Among the data collected for the World Health Organization air quality 
monitoring project is a measure of suspended particles in we/m?. Let X and Y equal 
the concentration of suspended particles in g/m? in the city center (commercial 
district) for Melbourne and Houston, respectively. Using n = 13 observations of X 
and m = 16 observations of Y, we test Ho : ux = pry against Hy : ux < py. 


(a) Define the test statistic and critical region, assuming that the unknown vari- 
ances are equal. Let a = 0.05. 


(b) If F = 72.9, s, = 25.6, ¥ = 81.7, and sy = 28.3, calculate the value of the 
test statistic and state your conclusion. 
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4.6.8. Let p equal the proportion of drivers who use a seat belt in a country that 
does not have a mandatory seat belt law. It was claimed that p = 0.14. An 
advertising campaign was conducted to increase this proportion. Two months after 
the campaign, y = 104 out of a random sample of n = 590 drivers were wearing 
their seat belts. Was the campaign successful? 


(a) Define the null and alternative hypotheses. 
(b) Define a critical region with an a = 0.01 significance level. 


(c) Determine the approximate p-value and state your conclusion. 


4.6.9. In Exercise 4.2.18 we found a confidence interval for the variance o? using 


the variance S$? of a random sample of size n arising from N (1,07), where the mean 
wis unknown. In testing Ho : 0? = of against H, : 0? > o%, use the critical region 
defined by (n — 1)$?/o2 > c. That is, reject Ho and accept H, if S$? > co2/(n—1). 
If n = 13 and the significance level a = 0.025, determine c. 


4.6.10. In Exercise 4.2.27, in finding a confidence interval for the ratio of the 
variances of two normal distributions, we used a statistic S?/S3, which has an F- 
distribution when those two variances are equal. If we denote that statistic by F, 
we can test Ho : 07 = 03% against H, : 07 > o3 using the critical region F > c. If 


n = 13, m= 11, and a = 0.05, find c. 


4.7 Chi-Square Tests 


In this section we introduce tests of statistical hypotheses called chi-square tests. 
A test of this sort was originally proposed by Karl Pearson in 1900, and it provided 
one of the earlier methods of statistical inference. 

Let the random variable X; be N(ji;,07), i= 1,2,...,n, and let X1, Xo,...,Xn 
be mutually independent. Thus the joint pdf of these variables is 


1 1 7 Li [i 7 
———— « —= » —0O < 24 < OO. 
0102 +++ On(2n)”/? exp 5 d, ( 5 ) | CO < 2j< CO 


The random variable that is defined by the exponent (apart from the coefficient 
—$) is 07 [(Xi — wi)/oi}?, and this random variable has a ?(n) distribution. In 
Section 3.5 we generalized this joint normal distribution of probability to n random 
variables that are dependent and we called the distribution a multivariate normal 
distribution. Theorem 3.5.1 shows a similar result holds for the exponent in the 
multivariate normal case, also. 

Let us now discuss some random variables that have approximate chi-square 


distributions. Let X, be b(n, pi). Consider the random variable 


X1— pi 
mpi(1 — pi) 
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which has, as n — oo, an approximate N(0,1) distribution (see Theorem 4.2.1). 
Furthermore, as discussed in Example 5.3.6, the distribution of Y? is approximately 
x7(1). Let Xp = n— X, and let pp = 1—p,. Let Qi = Y?. Then Q; may be 
written as 


6r= (Xi-mpi)? _ (Xi =npi)? | (Xi =p)? 
npi(1 — pi) np n(1 — pi) 
» fe 2 on 2 
= (Xi — pi) te (X2 — mp2) (4.7.1) 
il mp2 


because (X1 — np)? = (n— Xp —n+npe)? = (X2— np2)?. This result can be 
generalized as follows. 

Let X,, Xo,...,X,—1 have a multinomial distribution with the parameters n 
and pi,...,Pe—1, as in Section 3.1. Let X, = n — (X1, +--+ + Xx_-1) and let 
pr =1—(pit+---+ pei). Define Qx_1 by 


k 


ee? 
Qr-1 = SS ie ney 


nv; 
i=1 Pi 


It is proved in a more advanced course that, as n — oo, Q,z_1 has an approximate 
x?(k — 1) distribution. Some writers caution the user of this approximation to be 
certain that n is large enough so that each np;, 1 = 1,2,...,k, is at least equal 
to 5. In any case, it is important to realize that Q,—1 does not have a chi-square 
distribution, only an approximate chi-square distribution. 

The random variable Q,_ 1 may serve as the basis of the tests of certain statis- 
tical hypotheses which we now discuss. Let the sample space A of a random ex- 
periment be the union of a finite number k of mutually disjoint sets A,, Ag,..., Ag. 
Furthermore, let P(A;) = pi, i = 1,2,...,k, where py = 1 — py — ++: — pr-i, 
so that p; is the probability that the outcome of the random experiment is an 
element of the set A;. The random experiment is to be repeated n indepen- 
dent times and X; represents the number of times the outcome is an element 
of set A;. That is, X ,,X2,...,X, = n — X, —--: — Xp_1 are the frequen- 
cies with which the outcome is, respectively, an element of A,, Ag,..., Az. Then 
the joint pmf of X,,Xo,...,Xz-1 is the multinomial pmf with the parameters 
nN, P1,---;Pk—-1- Consider the simple hypothesis (concerning this multinomial pmf) 
Ho : pi = Pio, P2 = P20,--+,Pk—-1 = Pr—1,0 (Pk = Peo = 1 — pio — ++: — Pr-1,0), 
where pio,---,;Pk—1,0 are specified numbers. It is desired to test Ho against all 
alternatives. 

If the hypothesis Hp is true, the random variable 


k 
(X; — npio)? 
6.45) 
: d, NPio 


has an approximate chi-square distribution with k — 1 degrees of freedom. Since, 
when Hp is true, npjg is the expected value of X;, one would feel intuitively that 
observed values of Q,~-1 should not be too large if Ho is true. Our test is then 
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to reject Ho if Qx-1 => c. To determine a test with level of significance a, we 
can use tables of the y?-distribution or a computer package. Using R, we compute 
the critical value c by qchisq(1 — a,k-1). If, then, the hypothesis Hp is rejected 
when the observed value of Q,—1 is at least as great as c, the test of Hp has a 
significance level that is approximately equal to a. Also if q is the realized value of 
the test statistic Q,—1 then the observed significance level of the test is computed 
in R by 1-pchisq(q,k-1). This is frequently called a goodness-of-fit test. Some 
illustrative examples follow. 


Example 4.7.1. One of the first six positive integers is to be chosen by a random 
experiment (perhaps by the cast of a die). Let Aj = {a: x =i}, i=1,2,...,6. The 
hypothesis Ho : P(A;) = pio = %, 1 = 1,2,...,6, is tested, at the approximate 5% 
significance level, against all alternatives. To make the test, the random experiment 
is repeated under the same conditions, 60 independent times. In this example, k = 6 
and npio = 60(z) = 10,7 = 1,2,...,6. Let X; denote the frequency with which 
the random experiment terminates with the outcome in A;, i = 1,2,...,6, and let 
Q5 = yo (x; —10)?/10. Since there are 6 — 1 = 5 degrees of freedom, the critical 
value for a level a = 0.05 test is qchisq(0.95,5) = 11.0705. Now suppose that 
the experimental frequencies of A;, Ag,..., Ag are, respectively, 13, 19, 11, 8, 5, and 
4. The observed value of Qs is 


2 2 2 2 2 2 
(13 — 10) % (19 — 10) 2 (11 — 10) ae (8 — 10) fd (5 — 10) " (4 — 10) 156. 
10 10 10 10 10 10 
Since 15.6 > 11.0705, the hypothesis P(A;) = 
(approximate) 5% significance level. 
The following R segment computes this test, returning the test statistic and the 
p-value as shown: 
ps=rep(1/6,6); x=c(13,19,11,8,5,4); chisq.test(x,p=ps) 
X-squared = 15.6, df = 5, p-value = 0.008084. 


, 1=1,2,...,6, is rejected at the 


ale 


Example 4.7.2. A point is to be selected from the unit interval {7:0 <a < 1} 
by a random process. Let Ay = {t:0<a< th Ap = {a: + <a< sh, Ag = 
{a:5<a< 3}, and Ay ={x: # <a < 1}. Let the probabilities pj, i = 1,2,3,4, 
assigned to these sets under the hypothesis be determined by the pdf 22, 0 < a < 1, 
zero elsewhere. Then these probabilities are, respectively, 


1/4 
1 5 _ 5 af 
po= | 2rdx= 7g, P2= 7g P30= 74 P40 = 74: 


Thus the hypothesis to be tested is that p,,p2,p3, and pg = 1 — p, — po — p3 have 
the preceding values in a multinomial distribution with k = 4. This hypothesis is 
to be tested at an approximate 0.025 significance level by repeating the random 
experiment n = 80 independent times under the same conditions. Here the npjg for 
i = 1,2,3,4, are, respectively, 5, 15, 25, and 35. Suppose the observed frequencies 
of Ay, Az, A3, and Ay, are 6, 18, 20, and 36, respectively. Then the observed value 
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of Qs = (Xi — npio)?/(npio) is 


(6-5)? | (18-15)? | (20-25)? | (86-35)? _ 64 


5 15 25 ——g5- gg oe 


The following R segment calculates the test and p-value: 
x=c(6,18,20,36); ps=c(1,3,5,7)/16; chisq.test(x,p=ps) 
X-squared = 1.8286, df = 3, p-value = 0.6087 

Hence, we fail to reject Ho at level 0.0250. m 


Thus far we have used the chi-square test when the hypothesis Ho is a simple 
hypothesis. More often we encounter hypotheses Ho in which the multinomial prob- 
abilities p1, p2,...,px are not completely specified by the hypothesis Ho. That is, 
under Ho, these probabilities are functions of unknown parameters. For an illustra- 
tion, suppose that a certain random variable Y can take on any real value. Let us 
partition the space {y : —oo < y < co} into k mutually disjoint sets A,, Ao,..., Ax 
so that the events A,, Ao,..., A, are mutually exclusive and exhaustive. Let Ho be 
the hypothesis that Y is N(u,07) with and o? unspecified. Then each 


exp[—(y — 1)? /207] dy, i= 1,2,...,k, 


1 
a . V 210 


is a function of the unknown parameters js and 0”. Suppose that we take a random 
sample Y|,...,Y, of size n from this distribution. If we let X; denote the frequency 
of A;, i=1,2,...,k, so that X, + Xo+---+ X;, =n, the random variable 


k 


soap 
Qr-1 = S- ene 


np; 
4 Pi 


cannot be computed once Xj,..., Xz have been observed, since each p;, and hence 
Qz—1, is a function of 4 and o?. Accordingly, choose the values of y and o? that 
minimize Q;—1. These values depend upon the observed X; = 71,...,X,% = %% and 
are called minimum chi-square estimates of j: and a”. These point estimates of 
p and o? enable us to compute numerically the estimates of each p;. Accordingly, 
if these values are used, Q,-1 can be computed once Y,,Y2,...,¥n, and hence 
X 1, X2,...,Xx, are observed. However, a very important aspect of the fact, which 
we accept without proof, is that now Q;_1 is approximately y?(k — 3). That is, the 
number of degrees of freedom of the approximate chi-square distribution of Q,_ is 
reduced by one for each parameter estimated by the observed data. This statement 
applies not only to the problem at hand but also to more general situations. Two 
examples are now be given. The first of these examples deals with the test of the 
hypothesis that two multinomial distributions are the same. 


Remark 4.7.1. In many cases, such as that involving the mean py and the variance 
o? of a normal distribution, minimum chi-square estimates are difficult to com- 
pute. Other estimates, such as the maximum likelihood estimates of Example 4.1.3, 
ju =Y and o? = (n—1)$?/n, are used to evaluate p; and Qx_1. In general, Qx—1 


is not minimized by maximum likelihood estimates, and thus its computed value 
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is somewhat greater than it would be if minimum chi-square estimates are used. 
Hence, when comparing it to a critical value listed in the chi-square table with k—3 
degrees of freedom, there is a greater chance of rejection than there would be if the 
actual minimum of Q;_1 is used. Accordingly, the approximate significance level of 
such a test may be higher than the p-value as calculated in the y?-analysis. This 
modification should be kept in mind and, if at all possible, each p; should be esti- 
mated using the frequencies X,,..., Xx rather than directly using the observations 
Y1, Y2,..-, Yn of the random sample. 


Example 4.7.3. In this example, we consider two multinomial distributions with 
parameters n;,P1j,P2j,---,Pkj and j = 1,2, respectively. Let Xi;, i = 1,2,...,k, 
j = 1,2, represent the corresponding frequencies. If n; and nz are large and the 
observations from one distribution are independent of those from the other, the 
random variable 


is the sum of two independent random variables each of which we treat as though it 
were y7(k — 1); that is, the random variable is approximately y?(2k — 2). Consider 
the hypothesis 

Ag : pir = P12, P21 = p22,-++,Pk1 = Ps 


where each pj, = pi2, 7 = 1,2,...,k, is unspecified. Thus we need point estimates 
of these parameters. The maximum likelihood estimator of pj; = pj2, based upon 
the frequencies X;;, is (Xi1 + Xi2)/(n1 + ne), i= 1,2,...,k. Note that we need 
only k — 1 point estimates, because we have a point estimate of pp; = peg once we 
have point estimates of the first k — 1 probabilities. In accordance with the fact 
that has been stated, the random variable 


k 


= “i a Xi1 + Xi2)/(n1 + ne ? 
a=y> { \/( yt 


; Xi1 + Xi2)/(m1 + n2)| 


has an approximate x? distribution with 2k —2—(k—1) = k—1 degrees of freedom. 
Thus we are able to test the hypothesis that two multinomial distributions are the 
same. For a specified level a, the hypothesis Ho is rejected when the computed 
value of Qz_1 exceeds the 1 — a quantile of a y?-distribution with k — 1 degrees of 
freedom. This test is often called the chi-square test for homogeneity (the null is 
equivalent to homogeneous distributions). m 


The second example deals with the subject of contingency tables. 


Example 4.7.4. Let the result of a random experiment be classified by two at- 
tributes (such as the color of the hair and the color of the eyes). That is, one 
attribute of the outcome is one and only one of certain mutually exclusive and 
exhaustive events, say A,,A2,...,Aa; and the other attribute of the outcome is 
also one and only one of certain mutually exclusive and exhaustive events, say 
By, Bo,..., By. Let py = P(AiN Bj), i =1,2,...,a; 7 =1,2,...,b. The random 
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experiment is repeated n independent times and X;; denotes the frequency of the 
event A; B;. Since there are k = ab such events as Aj B;, the random variable 


Oar = Son Aaa nP)? 


n 
j=l i=l Pig 


has an approximate chi-square distribution with ab—1 degrees of freedom, provided 
that n is large. Suppose that we wish to test the independence of the A and the B 
attributes, i.e., the hypothesis Hp : P(A; N Bj) = P(A;)P(B;), i= 1,2,...,q f= 
1,2,...,0. Let us denote P(A;) by p;. and P(B;) by pj. It follows that 


Pi. = Sp Pj =p and i= ->p, =p. 


j=li=l 


Then the hypothesis can be formulated as Ho : pij = pi.p.j, 1 = 1,2,...,a; 7 = 
1,2,...,5. To test Hp, we can use Qap-1 with p,; replaced by p;.p.j;. But if 
pi., 1 = 1,2,...,a, and pj, 9 = 1,2,...,6, are unknown, as they frequently are 
in applications, we cannot compute Q,y—1 once the frequencies are observed. In 
such a case, we estimate these unknown parameters by 


b 
Di. = Ae. where Xj. = yy, for i= 1,2; woe G 
j=l 


and 


pj = 4, where X; =) Xiz, for j = 1,2,...,0. 


i=l 


Since } |. py. = ar p.j = 1, we have estimated only a—1+b—1 = a+b—2 parameters. 
So if these estimates are used in Q,p_-1, with pi; = pi.p.;, then, according to the 
rule that has been stated in this section, the random variable 


b 
Xi — (X;,[n\(X lm) 
yy aE oe 


t=1 


has an approximate chi-square distribution with ab—1—-—(a+b—2) = (a—1)(b—1) 
degrees of freedom provided that Hp is true. For a specified level a, the hypothesis 
Ho is then rejected if the computed value of this statistic exceeds the 1 — a quantile 
of a x?-distribution with (a — 1)(b— 1) degrees of freedom. This is the y?-test for 
independence. 

For an illustration, reconsider Example 4.1.5 in which we presented data on hair 
color of Scottish children. The eye colors of the children were also recorded. The 
complete data are in the following contingency table (with additionally the marginal 
sums). The contingency table is also in the file scotteyehair.rda. 
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| | Fair | Red | Medium | Dark | Black 
1368 | 170 | 1041 | 398 T|__2978 
2577 | 474 2703 | 932[ 11 i] 6097 


1390 | 20/3826 | 1842 [33 
454 | 255 1848 | 2506 | __112 
5789 | 1319 | 9418 | 5678 | 157 || 22361 


The table indicates that hair and eye color are dependent random variables. For 
example, the observed frequency of children with blue eyes and black hair is 1 
while the expected frequency under independence is 2978 x 157/22361 = 20.9. The 
contribution to the test statistic from this one cell is (1 — 20.9)?/20.9 = 19.95 
that nearly exceeds the test statistic’s 7 critical value at level 0.05, which is 
qchisq(.95,12) = 21.026. The y?-test statistic for independence is tedious to 
compute and the reader is advised to use a statistical package. For R, assume 
that the contingency table without margin sums is in the matrix scotteyehair. 
Then the code chisq.test (scotteyehair) returns the y? test statistic and the 
p-value as: X-squared = 3683.9, df = 12, p-value < 2.2e-16. Thus the re- 
sult is highly significant. Based on this study, hair color and eye color of Scottish 
children are dependent on one another. To investigate where the dependence is the 
strongest in a contingency table, we recommend considering the table of expected 
frequencies and the table of Pearson residuals. The later are the square roots 
(with the sign of the numerators) of the summands in expression (4.7.2) defining the 
test statistic. The sum of the squared Pearson residuals equals the y?-test statistic. 
In R, the following code obtains both of these items: 
fit = chisq.test(scotteyehair); fit$expected; fit$residual 

Based on running this code, the largest residual is 32.8 for the cell dark hair and 
dark eyes. The observed frequency is 2506 while the expected frequency under 
independence is 1314. m 


In each of the four examples of this section, we have indicated that the statistic 
used to test the hypothesis Hp has an approximate chi-square distribution, provided 
that n is sufficiently large and Ho is true. To compute the power of any of these tests 
for values of the parameters not described by Ho, we need the distribution of the 
statistic when Ho is not true. In each of these cases, the statistic has an approximate 
distribution called a noncentral chi-square distribution. The noncentral chi- 
square distribution is discussed later in Section 9.3. 


EXERCISES 


4.7.1. Consider Example 4.7.2. Suppose the observed frequencies of Aj,...,A4 
are 20, 30, 92, and 105, respectively. Modify the R code given in the example to 
calculate the test for these new frequencies. Report the p-value. 


4.7.2. A number is to be selected from the interval {x : 0 < a < 2} by a random 


process. Let A; = {x : (¢-—1)/2 < a < i/2}, ¢ = 1,2,3, and let Ay = {a : 
~ <a < 2}. For i = 1,2,3,4, suppose a certain hypothesis assigns probabilities 


pio to these sets in accordance with pip = S4,)Q —a)dz, i = 1,2,3,4. This 
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hypothesis (concerning the multinomial pdf with k = 4) is to be tested at the 5% 
level of significance by a chi-square test. If the observed frequencies of the sets 
A;, 1 = 1,2,3,4, are respectively, 30, 30, 10, 10, would Hp be accepted at the 
(approximate) 5% level of significance? Use R code similar to that of Example 4.7.2 
for the computation. 


4.7.3. Define the sets Ay = {w : —co <u < Of, Aj = {em :i-2<a<i-l}, 
i=2,...,7, and Ag = {1:6 < a < co}. A certain hypothesis assigns probabilities 
pio to these sets A; in accordance with 


1 a 
es eae | | a, GS 
ae I, Jon x | 4) | * 


This hypothesis (concerning the multinomial pdf with k = 8) is to be tested, at the 
5% level of significance, by a chi-square test. If the observed frequencies of the sets 
A;, i = 1,2,...,8, are, respectively, 60, 96, 140, 210, 172, 160, 88, and 74, would 
Ho be accepted at the (approximate) 5% level of significance? Use R code similar 
to that discussed in Example 4.7.2. The probabilities are easily computed in R; for 
example, p39 = pnorm(2,3,2) — pnorm(1,3,2). 


4.7.4. A die was cast n = 120 independent times and the following data resulted: 


Spots Up |1 2 38 4 = 5 6 
Frequency |b 20 20 20 20 40—b 


If we use a chi-square test, for what values of b would the hypothesis that the die 
is unbiased be rejected at the 0.025 significance level? 


4.7.5. Consider the problem from genetics of crossing two types of peas. The 
Mendelian theory states that the probabilities of the classifications (a) round and 
yellow, (b) wrinkled and yellow, (c) round and green, and (d) wrinkled and green 
are 7g; *. ig? and Te respectively. If, from 160 independent observations, the 
observed frequencies of these respective classifications are 86, 35, 26, and 13, are 
these data consistent with the Mendelian theory? That is, test, with a = 0.01, the 
hypothesis that the respective probabilities are 2. =. =. and oe 
4.7.6. Two different teaching procedures were used on two different groups of stu- 
dents. Each group contained 100 students of about the same ability. At the end of 
the term, an evaluating team assigned a letter grade to each student. The results 
were tabulated as follows. 


Grade 
Group A B CC DF Total 
I 15 25 32 17 #11 ~~ #100 
II 9 18 29 28 16 100 


If we consider these data to be independent observations from two respective multi- 
nomial distributions with k = 5, test at the 5% significance level the hypothesis 
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Table 4.7.1: Contingency Table for Type of Crime and Alcoholic Status Data 


Crime Alcoholic | Non-Alcoholic 
Arson 50 43 
Rape 88 62 
Violence 155 110 
Theft 379 300 
Coining 18 14 
Fraud 63 144 


that the two distributions are the same (and hence the two teaching procedures are 
equally effective). For computation in R, use 
ri=c(15,25,32,17,11) ;r2=c(9,18,29,28,16) ;mat=rbind(ri,r2) 
chisq.test (mat) 


4.7.7. Kloke and McKean (2014) present a data set concerning crime and alco- 
holism. The data they discuss is in Table 4.7.1. It contains the frequencies of 
criminals who committed certain crimes and whether or not they are alcoholics. 
The data are also in the file crimealk.rda. 


(a) Using code similar to that given in Exercise 4.7.6, compute the x?-test for 
independence between type of crime and alcoholic status. Conclude in terms 
of the problem, using the p-value. 


(b) Use the Pearson residuals to determine which part of the table contains the 
strongest information concerning dependence. 


(c) Use a y?-test to confirm your suspicions in Part (b). This is a conditional test 
based on the data, but, in practice, such tests are used for planning future 
studies. 


4.7.8. Let the result of a random experiment be classified as one of the mutually 
exclusive and exhaustive ways A,, Az, A3 and also as one of the mutually exhaustive 
ways B,, Bo, Bs, By. Say that 180 independent trials of the experiment result in 
the following frequencies: 


7 


where k is one of the integers 0,1,2,3,4,5. What is the smallest value of & that 
leads to the rejection of the independence of the A attribute and the B attribute at 
the a = 0.05 significance level? 


4.7.9. It is proposed to fit the Poisson distribution to the following data: 
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x 0 1 2 3 3<24 
Frequency | 20 40 16 18 6 


(a) Compute the corresponding chi-square goodness-of-fit statistic. 
Hint: In computing the mean, treat 3< 2@asa=4. 


(b) How many degrees of freedom are associated with this chi-square? 


(c) Do these data result in the rejection of the Poisson model at the a = 0.05 
significance level? 


4.8 The Method of Monte Carlo 


In this section we introduce the concept of generating observations from a speci- 
fied distribution or sample. This is often called Monte Carlo generation. This 
technique has been used for simulating complicated processes and investigating fi- 
nite sample properties of statistical methodology for some time now. In the last 30 
years, however, this has become a very important concept in modern statistics in 
the realm of inference based on the bootstrap (resampling) and modern Bayesian 
methods. We repeatedly make use of this concept throughout the book. 

For the most part, a generator of random uniform observations is all that is 
needed. It is not easy to construct a device that generates random uniform observa- 
tions. However, there has been considerable work done in this area, not only in the 
construction of such generators, but in the testing of their accuracy as well. Most 
statistical software packages, such as R, have reliable uniform generators. 

Suppose then we have a device capable of generating a stream of independent 
and identically distributed observations from a uniform (0,1) distribution. For 
example, the following command generates 10 such observations in the language R: 
runif (10). In this command the r stands for random, the unif stands for uniform, 
the 10 stands for the number of observations requested, and the lack of additional 
arguments means that the standard uniform (0,1) generator is used. 

For observations from a discrete distribution, often a uniform generator suffices. 
For a simple example, consider an experiment where a fair six-sided die is rolled 
and the random variable X is 1 if the upface is a “low number,” namely {1, 2}; 
otherwise, X = 0. Note that the mean of X is uw = 1/3. If U has a uniform (0, 1) 
distribution, then X can be realized as 


xf 1 if0<uU<1/3 
~ | 0 if1/8<U<1. 


Using the command above, we used the following R code to generate 10 observations 
from this experiment: 


n = 10; u = runif(n); x = rep(0,n); x[u < 1/3] = 1; x 


The following table displays the results. 
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ui | 0.4743 0.7891 0.5550 0.9693 0.0299 
Li 0 0 0 0 1 


ui | 0.8425 0.6012 0.1009 0.0545 0.4677 
Xi 0 0 1 1 0 


Note that observations form a realization of a random sample X,,..., X19 drawn 
from the distribution of X. For these 10 observations, the realized value of the 
statistic X is F = 0.3. 


Example 4.8.1 (Estimation of 7). Consider the experiment where a pair of num- 
bers (U1, Uz) is chosen at random in the unit square, as shown in Figure 4.8.1; that 
is, U; and U2 are iid uniform (0,1) random variables. Since the point is chosen at 
random, the probability of (U,,U2) lying within the unit circle is 7/4. Let X be 
the random variable, 
x={ 1 ifU?+U3 <1 
0 otherwise. 


0.0 i 
0.0 0.5 1.0 


Figure 4.8.1: Unit square with the first quadrant of the unit circle, Example 4.8.1. 


Hence the mean of X is 4 = 7/4. Now suppose 7 is unknown. One way of 
estimating a is to repeat the experiment n independent times, hence, obtaining 
arandom sample X1,...,X, on X. The statistic 4X is an unbiased estimator of 7. 
The R function piest repeats this experiment n times, returning the estimate of 7. 
This function and other R functions discussed in this chapter are available at the 
site discussed in the Preface. Figure 4.8.1 shows 20 realizations of this experiment. 
Note that of the 20 points, 15 fall within the unit circle. Hence our estimate of 7 is 
4(15/20) = 3.00. We ran this code for various values of n with the following results: 
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n 100 500 1000 10,000 100,000 
3.24 3.072 3.182 3.1388 3.13828 


4 
1.96-4,/Z1—)/n | 0.308 0.148 0.102 0.032 0.010 


We can use the large sample confidence interval derived in Section 4.2 to estimate 
the error of estimation. The corresponding 95% confidence interval for 7 is 


(42- 1.96 -4,/E(1 — B)/n, 42 + 1.96 -4,/z01 — 7)/n) (4.8.1) 


The last row of the above table contains the error part of the confidence intervals. 
Notice that all five confidence intervals trapped the true value of 7. 


What about continuous random variables? For these we have the following 
theorem: 


Theorem 4.8.1. Suppose the random variable U has a uniform (0,1) distribution. 
Let F be a continuous distribution function. Then the random variable X = F~'(U) 
has distribution function F. 


Proof: Recall from the definition of a uniform distribution that U has the distri- 
bution function Fy(u) = u for u € (0,1). Using this, the distribution-function 
technique, and assuming that F(x) is strictly monotone, the distribution function 
of X is 


PIX<a]) = PIF“) <a] 


which proves the theorem. 


In the proof, we assumed that F(x) was strictly monotone. As Exercise 4.8.13 
shows, we can weaken this. 

We can use this theorem to generate realizations (observations) of many different 
random variables. For example, suppose X has the ['(1, 3)-distribution. Suppose 
we have a uniform generator and we want to generate a realization of X. The 
distribution function of X is 


F(a) =1-e-7/8, «>0. 
Hence the inverse of the distribution function is given by 
F~'(u) =—Blog(l—u), O<u<1. (4.8.2) 


So if U has the uniform (0,1) distribution, then X = —(log(1—U) has the T'(1, 3)- 
distribution. For instance, suppose 3 = 1 and our uniform generator generated the 
following stream of uniform observations: 


0.473, 0.858, 0.501, 0.676, 0.240. 
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Then the corresponding stream of exponential observations is 
0.641, 1.95, 0.696, 1.13, 0.274. 


As the next example shows, we can generate Poisson realizations using this expo- 
nential generation. 


Example 4.8.2 (Simulating Poisson Processes). Let X be the number of occur- 
rences of an event over a unit of time and assume that it has a Poisson distribution 
with mean A, (3.2.1). Let 7), To, 73,... be the interarrival times of the occurrences. 
Recall from Remark 3.3.1 that T),7>,73,... are iid with the common I(1,1/A)- 
distribution. Note that X =k if and only if 7", Tj <1 and +) T; > 1. Using 
this fact and the generation of [(1,1/A) variates discussed above, the following 
algorithm generates a realization of X (assume that the uniforms generated are 
independent of one another). 


1. Set X =0 and T = 0. 
2. Generate U uniform (0,1) and let Y = —(1/A) log(1 — U). 
3. Set T=T+Y. 
4. If T>1, output Xx; 
else set X = X + 1 and go to step 2. 


The R function poisrand provides an implementation of this algorithm, generating 
n simulations of a Poisson distribution with parameter A. As an illustration, we 
obtained 1000 realizations from a Poisson distribution with A = 5 by running R 
with the R code temp = poisrand(1000,5), which stores the realizations in the 
vector temp. The sample average of these realizations is computed by the command 
mean(temp). In the situation that we ran, the realized mean was 4.895. @ 


Example 4.8.3 (Monte Carlo Integration). Suppose we want to obtain the integral 
ihe g(x) dx for a continuous function g over the closed and bounded interval [a, b]. 
If the antiderivative of g does not exist, then numerical integration is in order. A 
simple numerical technique is the method of Monte Carlo. We can write the integral 
as 


b b 
J s{eae = (6-4) | gle)p— de = (ba) Blg(X)) 


where X has the uniform (a, b) distribution. The Monte Carlo technique is then to 
generate a random sample X1,...,X;,, of size n from the uniform (a, b) distribution 


and compute Y; = (b — a)g(X;). Then Y is an unbiased estimator of iC g(x) dz. & 


Example 4.8.4 (Estimation of s by Monte Carlo Integration). For a numerical 
example, reconsider the estimation of 7. Instead of the experiment described in 
Example 4.8.1, we use the method of Monte Carlo integration. Let g(a) = 4V1 — x? 
for0<a<1. Then 


7 | g(x) dx = Elg(X)], 


where X has the uniform (0,1) distribution. Hence we need to generate a random 
sample X),...,X, from the uniform (0,1) distribution and form Y; = 4\/1 — X?. 
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Then Y is a unbiased estimator of 7. Note that Y is estimating a mean, so the 
large sample confidence interval (4.2.6) derived in Example 4.2.2 for means can be 
used to estimate the error of estimation. Recall that this 95% confidence interval is 
given by 
(7 — 1.96s/V/n, 7 + 1.96s//n), 

where s is the value of the sample standard deviation. We coded this algorithm 
in the R function piest2. The table below gives the results for estimates of a for 
various runs of different sample sizes along with the confidence intervals. 


100 1000 10,000 100,000 


a 3.217849 3.103322 3.135465 3.142066 


7 —1.96(s//n) | 3.054664 3.046330 3.118080 3.136535 
7+1.96(s//n) | 3.381034 3.160314 3.152850 3.147597 


Note that for each experiment the confidence interval trapped 7. m 


Numerical integration techniques have made great strides over the last 30 years. 
But the simplicity of integration by Monte Carlo still makes it a powerful technique. 

As Theorem 4.8.1 shows, if we can obtain Fy'(u) in closed form, then we can 
easily generate observations with cdf Fx. In many cases where this is not possible, 
techniques have been developed to generate observations. Note that the normal 
distribution serves as an example of such a case, and, in the next example, we show 
how to generate normal observations. In Section 4.8.1, we discuss an algorithm that 
can be adapted for many of these cases. 


Example 4.8.5 (Generating Normal Observations). To simulate normal variables, 
Box and Muller (1958) suggested the following procedure. Let Yi, Y2 be a random 
sample from the uniform distribution over 0 < y < 1. Define X, and X2 by 


X, = (-2logY,)!/? cos(27¥2), 
Xp = (—2logY,)'/? sin(2rYo). 
This transformation is one-to-one and maps {(yi,y2):0< y <1, 0 < y2 < 1} 


onto {(@1,%2) : —co < 41 < 00, —CO < Xp < co} except for sets involving 7, = 0 
and #2 = 0, which have probability zero. The inverse transformation is given by 


2 2 
ate 
y1 = exp (- : 2). 
2 
1 v2 
y2 = = arctan—. 
20 Ty 


This has the Jacobian 
2 2 2 2 
(—21) exp (3) (—2p) exp (-45*) 


/ = —2x2/x? 1/4 
Qmy(1 + 23/23) Qn +a3/a) 
2, 2 2 2 
—(1 + 23/22) exp (--2*) 25 (4) 


(am) + a3/a?) tm 
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Since the joint pdf of Yi and Y2 is 1 on 0 < y, < 1,0 < yo < 1, and zero elsewhere, 
the joint pdf of X, and X2 is 


exp (- ate ) 


5 » ~~O<%<c, —CO <4%2< OO. 
1 


That is, X, and X2 are independent, standard normal random variables. One of the 
most commonly used normal generators is a variant of the above procedure called 
the Marsaglia and Bray (1964) algorithm; see Exercise 4.8.21. m 


Observations from a contaminated normal distribution, discussed in Section 
3.4.1, can easily be generated using a normal generator and a uniform generator. 
We close this section by estimating via Monte Carlo the significance level of a t-test 
when the underlying distribution is a contaminated normal. 


Example 4.8.6. Let X be a random variable with mean js and consider the hy- 
potheses 
Hy: 4 =0 versus Hy: pw > 0. (4.8.3) 


Suppose we decide to base this test on a sample of size n = 20 from the distribution 
of X, using the t-test with rejection rule 


Reject Hp : w =0 in favor of Hy: pp > Oif t > to519 = 1.729, (4.8.4) 


where t = %/(s//20) and Z and s are the sample mean and standard deviation, 
respectively. If X has a normal distribution, then this test has level 0.05. But what 
if X does not have a normal distribution? In particular, for this example, suppose 
X has the contaminated normal distribution given by (3.4.17) with « = 0.25 and 
Oe = 25; that is, 75% of the time an observation is generated by a standard normal 
distribution, while 25% of the time it is generated by a normal distribution with 
mean 0 and standard deviation 25. Hence the mean of X is 0, so Ho is true. 
To obtain the exact significance level of the test would be quite complicated. We 
would have to obtain the distribution of t when X has this contaminated normal 
distribution. As an alternative, we estimate the level (and the error of estimation) 
by simulation. Let N be the number of simulations. The following algorithm gives 
the steps of our simulation: 


1. Setk =1,7=0. 

2. Simulate a random sample of size 20 from the distribution of X. 

3. Based on this sample, compute the test statistic t. 

4. If t > 1.729, increase I by 1. 

5. If k = N; go to step 6; else increase k by 1 and go to step 2. 

6. Compute @ = I/N and the approximate error = 1.96,/a(1 — @)/N. 
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Then @ is our simulated estimate of a and the half-width of a confidence interval 
for @ serves as our estimate of the error of estimation. 

The R function empalphacn implements this algorithm. We ran it for N = 
10,000 obtaining the results: 


95% Ol for a 


10,000 0.0412 0.0039 | (0.0373, 0.0451) 


Based on these results, the t-test appears to be conservative when the sample is 
drawn from this contaminated normal distribution. m 


4.8.1 Accept—Reject Generation Algorithm 


In this section, we develop the accept—reject procedure that can often be used to 
simulate random variables whose inverse cdf cannot be obtained in closed form. Let 
X be a continuous random variable with pdf f(x). For this discussion, we call this 
pdf the target pdf. Suppose it is relatively easy to generate an observation of the 
random variable Y which has pdf g(x) and that for some constant M we have 


f(a) < Mg(a), -o <a < ow. (4.8.5) 


We call g(a) the instrumental pdf. For clarity, we write the accept—reject as an 
algorithm: 


Algorithm 4.8.1 (Accept—Reject Algorithm). Let f(a) be a pdf. Suppose that Y 
is a random variable with pdf g(y), U is a random variable with a uniform(0, 1) 
distribution, Y and U are independent, and (4.8.5) holds. The following algorithm 
generates a random variable X with pdf f(a). 


1. Generate Y and U. 


2. TfU< a , then take X =Y. Otherwise return to step 1. 


3. X has pdf f(a). 
Proof of the validity of the algorithm: Let —oo < x < oo. Then 


P[X <a] = ply salu < i) 


i. Ea du| g(y)dy 
fem aes du a(y)dy 


Pot Soa 
Yaar Mot I yy 


=, = Flu)au. (4.8.7) 
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Hence, by differentiating both sides, we find that the pdf of X is f(x). m 


There are two facts worth noting. First, the probability of an acceptance in the 
algorithm is 1/M. This can be seen in the derivation in the proof of the theorem. 
Just consider the denominators in the derivation which show that 


f¥Y) | 1 


224 (4.8.8) 


< 
re 1 


~ Mg(Y) 
Hence, for efficiency of the algorithm we want M as small as possible. Secondly, 
normalizing constants of the two pdfs f(x) and g(a) can be ignored. For example, 
if f(x) = kh(x) and g(x) = ct(x) for constants c and k, then we can use the rule 


h(a) < Mgt(x), -oo< 4<om, (4.8.9) 


and change the ratio in step 2 of the algorithm to U < h(Y)/[Mot(Y)]. It follows 
directly that expression (4.8.5) holds if and only if expression (4.8.9) holds where 
Mz =cM/k. This often simplifies the use of the accept-reject algorithm. 

We next present two examples of the accept-—reject algorithm. The first exam- 
ple offers a normal generator where the instrumental random variable, Y, has a 
Cauchy distribution. The second example shows how all gamma distributions can 
be generated. 


Example 4.8.7. Suppose that X is a normally distributed random variable with 
pdf ¢(a) = (27)~\/? exp{—a?/2} and Y has a Cauchy distribution with pdf g(x) = 
nm '(1+a7)~1. As Exercise 4.8.9 shows, the Cauchy distribution is easy to simulate 
because its inverse cdf is a known function. Ignoring normalizing constants, the 
ratio to bound is 
a x (1+ 27) exp{—2?/2}, -—co<a£<oo. (4.8.10) 
g(x 
As Exercise 4.8.17 shows, the derivative of this ratio is —xexp{—x?/2}(x? — 1) 
which has critical values at +1. These values provide maxima to (4.8.10). Hence, 


o] 


(1 + 2”) exp{—2z?/2} < 2exp{—1/2} = 1.213, 


so Mz = 1.213. Hence, from the above discussion, M = (m/V27)1.213 = 1.520. 
Hence, the acceptance rate of the algorithm is 1/M/ = 0.6577. m 


Example 4.8.8. Suppose we want to generate observations from a ['(a, 3). First, 
if Y has a ['(a, 1)-distribution then GY has a I(a, 3)-distribution. Hence, we need 
only consider ['(a,1) distributions. So let X have a I'(a,1)-distribution. If a is a 
positive integer then by Theorem 3.3.1 we can write X as 


X=T%4+7T2+---+Ta, 


where 7), 75,:-- , Ty are independent and identically distributed with the common 
I'\(1, 1)-distribution. In the discussion around expression (4.8.2), we have shown how 
to generate Tj. 
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Assume then that X has a I'(a,1) distribution, where a is not an integer. As- 
sume first that a > 1. Let Y have a I'({a],1/b) distribution, where b < 1 is chosen 
later and, as usual, [a] means the greatest integer less than or equal to a. To es- 
tablish rule (4.8.9), consider the ratio, with h(x) and ¢(x) proportional to the pdfs 
of x and y, respectively, given by 


= ple go-lele- Ade (4.8.11) 


where we have ignored some of the normalizing constants. We next determine the 
constant 0. 
As Exercise 4.8.14 shows, the derivative of expression (4.8.11) is 


d 
mg la] o—lol e—(—b)@ — p-lele—C—P)2/(q — fal) — 2(1 — b)Ja2—l9!1-1, (4.8.12) 


which has a maximum critical value at x = (a — [a])/(1 — 6). Hence, using the 
maximum of h(x)/t(2), 


A(x) — y-ta) [= lo] 
Ty <? a . (4.8.13) 


Now, we need to find our choice of b. Differentiating the right side of this inequality 
with respect to b, we get, as Exercise 4.8.15 shows, 


d —[a al-a _ —la al—a [a] — ab 
ila ie = —p- lel — pyle! Fen (4.8.14) 


which has a critical value at b = [a]/a < 1. As shown in that exercise, this value 
of b provides a minimum of the right side of expression (4.8.13). Thus, if we take 
b = [a]/a <1, then equality (4.8.13) holds and it is the tightest inequality possible 
and, hence, provides the highest acceptance rate. The final value of M is the right 
side of expression (4.8.13) evaluated at b = [a]/a < 1. 

What if 0 < a < 1? Then the above argument does not work. In this case 
write X = YU'/° where Y has a I'(a + 1, 1)-distribution, U has a uniform (0, 1)- 
distribution, and Y and U are independent. Then, as the derivation in Exercise 
4.8.16 shows, X has a I'(a, 1)-distribution and we are finished. 

For further discussion, see Kennedy and Gentle (1980) and Robert and Casella 
(1999). = 


EXERCISES 


4.8.1. Prove the converse of Theorem MCT. That is, let X be a random variable 
with a continuous cdf F(a). Assume that F(a) is strictly increasing on the space 
of X. Consider the random variable Z = F(X). Show that Z has a uniform 
distribution on the interval (0, 1). 
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4.8.2. Recall that log 2 = te — dx. Hence, by using a uniform(0,1) generator, 
approximate log 2. Obtain an error of estimation in terms of a large sample 95% 
confidence interval. Write an R function for the estimate and the error of estimation. 


Obtain your estimate for 10,000 simulations and compare it to the true value. 


4.8.3. Similar to Exercise 4.8.2 but now approximate i —= exp {—$t?} dt. 


Jan 
4.8.4. Suppose X is a random variable with the pdf fx(x) = b-!f((x — a)/b), 
where b > 0. Suppose we can generate observations from f(z). Explain how we can 
generate observations from fx (2). 


4.8.5. Determine a method to generate random observations for the logistic pdf, 
(4.4.11). Write an R function that returns a random sample of observations from 
a logistic distribution. Use your function to generate 10,000 observations from this 
pdf. Then obtain a histogram (use hist (x,pr=T), where x contains the observa- 
tions). On this histogram overlay a plot of the pdf. 


4.8.6. Determine a method to generate random observations for the following pdf: 


4e? O<a<l 
f(z) = { 0 elsewhere. 


Write an R function that returns a random sample of observations from this pdf. 


4.8.7. Obtain the inverse function of the cdf of the Laplace pdf, given by f(x) = 
(1/2)e7!*!, for —oo < x < oo. Write an R function that returns a random sample 
of observations from this distribution. 


4.8.8. Determine a method to generate random observations for the extreme-valued 
pdf that is given by 


f(z) =exp{x—e*}, -w<a<o. (4.8.15) 


Write an R function that returns a random sample of observations from an extreme- 
valued distribution. Use your function to generate 10,000 observations from this pdf. 
Then obtain a histogram (use hist (x,pr=T), where x contains the observations). 
On the histogram overlay a plot of the pdf. 


4.8.9. Determine a method to generate random observations for the Cauchy distri- 


bution with pdf 
1 


££) = ——; 
Write an R function that returns a random sample of observations from this Cauchy 
distribution. 


—00 < £ < 00. (4.8.16) 


4.8.10. Suppose we are interested in a particular Weibull distribution with pdf 


1 9,.2,,—x7/6° 
f(a) = gz ove 0<4r<o 
0 elsewhere. 
Determine a method to generate random observations from this Weibull distribu- 


tion. Write an R function that returns such a sample. 
Hint: Find F~*(u). 
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4.8.11. Consider the situation in Example 4.8.6 with the hypotheses (4.8.3). Write 
an algorithm that simulates the power of the test (4.8.4) to detect the alternative 
js = 0.5 under the same contaminated normal distribution as in the example. Modify 
the R function empalphacn(N) to simulate this power and to obtain an estimate of 
the error of estimation. 


4.8.12. For the last exercise, write an algorithm to simulate the significance level 
and power to detect the alternative j. = 0.5 for the test (4.8.4) when the underlying 
distribution is the logistic distribution (4.4.11). 


4.8.13. For the proof of Theorem 4.8.1, we assumed that the cdf was strictly in- 
creasing over its support. Consider a random variable X with cdf F(a) that is not 
strictly increasing. Define as the inverse of F(a) the function 


F-l(u) =inf{a: F(z) >u}, O<u<1. 


Let U have a uniform (0,1) distribution. Prove that the random variable F~!(U) 
has cdf F'(2). 


4.8.14. Verify the derivative in expression (4.8.12) and show that the function 
(4.8.11) attains a maximum at the critical value x = (a — [a])/(1 — d). 


4.8.15. Derive expression (4.8.14) and show that the resulting critical value b = 
[a]/a < 1 gives a minimum of the function that is the right side of expression 
(4.8.13). 


4.8.16. Assume that Y; has a F(a + 1,1)-distribution, Y2 has a uniform (0,1 
distribution, and Y; and Y2 are independent. Consider the transformation X; = 
ViY,/* and X_ = Yp. 


(a) Show that the inverse transformation is: y; = 21 /x3! “and yo = £2 with 
support 0 < 7, <oo andO0 <a <1. 


(b) Show that the Jacobian of the transformation is 1/x,!° and the pdf of (X1, X2) 
is 


1 xt Uy 1 
Few) = peyton | oh ae 0<21< ow and0< 29 <1. 


(c) Show that the marginal distribution of X, is I'(a, 1). 


4.8.17. Show that the derivative of the ratio in expression (4.8.10) is given by the 
function —x exp{—x?/2}(x? — 1) with critical values +1. Show that the critical 
values provide maxima for expression (4.8.10). 


4.8.18. Consider the pdf 
pe! OS ge 
(2) =) : 


0 elsewhere, 


for G> 1. 
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(a) Use Theorem 4.8.1 to generate an observation from this pdf. 
(b) Use the accept—reject algorithm to generate an observation from this pdf. 


4.8.19. Proceeding similar to Example 4.8.7, use the accept-reject algorithm to 
generate an observation from a ¢ distribution with r > 1 degrees of freedom when 
g(x) is the Cauchy pdf. 


4.8.20. For a > 0 and @ > 0, consider the following accept-reject algorithm: 


1. Generate U; and Uy iid uniform(0,1) random variables. Set V; = Uj /® and 
— 7/8 
V2 = U,"". 
2. Set W =V, + Vo. If W <1, set X = V|/W; else go to step 1. 
3. Deliver X. 


Show that X has a beta distribution with parameters a and (3, (3.3.9). See Kennedy 
and Gentle (1980). 


4.8.21. Consider the following algorithm: 
1. Generate U and V independent uniform (—1,1) random variables. 
2. Set W =U? 4+V?. 
3. If W > 1 go to step 1. 


4. Set Z = \/(—2logW)/W and let X; = UZ and X2 =VZ. 


Show that the random variables X; and X92 are iid with a common N(0, 1) distri- 
bution. This algorithm was proposed by Marsaglia and Bray (1964). 


4.9 Bootstrap Procedures 


In the last section, we introduced the method of Monte Carlo and discussed several 
of its applications. In the last few years, however, Monte Carlo procedures have 
become increasingly used in statistical inference. In this section, we present the 
bootstrap, one of these procedures. We concentrate on confidence intervals and 
tests for one- and two-sample problems in this section. 


4.9.1 Percentile Bootstrap Confidence Intervals 


Let X be a random variable of the continuous type with pdf f(a;6), for 6 € Q. 
Suppose X = (Xi, X2,...,X,) is a random sample on X and = 6(X) is a point 
estimator of 6. The vector notation, X, proves useful in this section. In Sections 4.2 
and 4.3, we discussed the problem of obtaining confidence intervals for 6 in certain 
situations. In this section, we discuss a general method called the percentile boot- 
strap procedure, which is a resampling procedure. It was proposed by Efron (1979). 
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Informative discussions of such procedures can be found in Efron and Tibshirani 
(1993) and Davison and Hinkley (1997). 
To motivate the procedure, suppose for the moment that 


6 has a N(0,02) distribution. (4.9.1) 
Then as in Section 4.2, a (1 — a)100% confidence interval for @ is (0,0), where 
6, =0- geal sl gs and 6y =0— 26) Dore, (4.9.2) 


and z(7 denotes the 100th percentile of a standard normal random variable; i.e., 
2 = 6-1(y), where © is the cdf of a N(0,1) random variable (see also Exercise 
4.9.5). We have gone to a superscript notation here to avoid confusion with the 
usual subscript notation on critical values. 7 7 
Now suppose that # and o¢ are realizations from the sample and @; and 6y are 


calculated as in (4.9.2). Next suppose that 6* is a random variable with a N (0, a+) 
distribution. Then, by (4.9.2), 


ee a) 
P(0* <0,)=P (* = sae) =a/2. (4.9.3) 


Likewise, P(O < 6u) = 1-—(a/2). Therefore, @, and Oy are the $100th and 
(1 — $)100th percentiles of the distribution of 6*. That is, the percentiles of the 
N(6,02) distribution form the (1 —@)100% confidence interval for 0. 

We want our final procedure to be quite general, so the normality assumption 
(4.9.1) is definitely not desired and, in Remark 4.9.1, we do show that this assump- 
tion is not necessary. So, in general, let H(t) denote the cdf of 0. 

In practice, though, we do not know the function H(t). Hence the above con- 
fidence interval defined by statement (4.9.3) cannot be obtained. But suppose we 
could take an infinite number of samples Xj, X2,...; obtain o* = 0(X*) for each 
sample X*; and then form the histogram of these estimates é*. The percentiles 
of this histogram would be the confidence interval defined by expression (4.9.3). 
Since we only have one sample, this is impossible. It is, however, the idea behind 
bootstrap procedures. 

Bootstrap procedures simply resample from the empirical distribution defined 
by the one sample. The sampling is done at random and with replacement and 
the resamples are all of size n, the size of the original sample. That is, suppose 
x’ = (#1,@2,...,%n) denotes the realization of the sample. Let F, denote the 
empirical distribution function of the sample. Recall that F, is a discrete cdf that 
puts mass n~! at each point x; and that F;,(x) is an estimator of F(x). Then a 


* 


bootstrap sample is a random sample, say x*’ = (x7},75,...,7%), drawn from F,. 


’ n 


For example, it follows from the definition of expectation that 


n 1 1 n 
E(2i)=>— = > as. (4.9.4) 
w=1 i=l 
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Likewise V(x*) = n~! >", (aj — Z)?; see Exercise 4.9.2. At first glance, this re- 
sampling the sample seems like it would not work. But our only information on 
sampling variability is within the sample itself, and by resampling the sample we 
are simulating this variability. 

We now give an algorithm that obtains a bootstrap confidence interval. For 
clarity, we present a formal algorithm, which can be readily coded into languages 
such as R. Let x’ = (a1,%2,...,%n) be the realization of a random sample drawn 
from a cdf F(a; 6), 6 € Q. Let @ be a point estimator of 6. Let B, an integer, denote 
the number of bootstrap replications, i.e., the number of resamples. In practice, B 
is often 3000 or more. 


1. Set j = 1. 
2. While 7 < B, do steps 2-5. 


3. Let x; be a random sample of size n drawn from the sample x. That is, the 
observations x; are drawn at random from 21, £2,...,£n, with replacement. 


n~ 


4. Let 07 = 0(x}). 


5. Replace j by 7 +1. 


6. Let B65 < a Sees ae denote the ordered values of 0%, 0%, Lee a. Let 
m = [(a/2)B], where [-] denotes the greatest integer function. Form the 
interval ~~ 2 

(Q(m)> OB +1—m)) (4.9.5) 
that is, obtain the $100% and (1 — $)100% percentiles of the sampling dis- 
tribution of 6*,6%,..., 6%. 


The interval in (4.9.5) is called the percentile bootstrap confidence interval for 
@. In step 6, the subscripted parenthetical notation is a common notation for order 
statistics (Section 4.4), which is handy in this section. 

For the remainder of this subsection, we use as our estimator of @ the sample 
mean. For the sample mean, the following R function percentciboot is an R im- 
plementation of this algorithm (it can be downloaded at the site listed in Chapter 
1): 

percentciboot <- function(x,b,alpha){ 

theta=mean(x); thetastar=rep(0,b); n=length(x) 

for(i in 1:b){xstar=sample(x,n,replace=T) 

thetastar [i]=mean(xstar) } 

thetastar=sort(thetastar); pick=round((alpha/2) *(b+1) ) 
lower=thetastar [pick] ; upper=thetastar[b-pick+1] 

list (theta=theta, lower=lower , upper=upper) } 

#list (theta=theta, lower=lower , upper=upper, thetasta=thetastar) } 


The input consists of the sample x, the number of bootstraps b, and the desired 
confidence coefficient alpha. The second line of code computes the mean and the 
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size of the sample and provides a vector to store the 6*s. In the for loop, the 
ith bootstrap sample is obtained by the single command sample(x,n,replace=T), 
which is followed by the computation of 6° . The remainder of the code forms the 
bootstrap confidence interval, while the list command returns the estimate and 
the bootstrap confidence interval. The optional second list command returns the 
6s, also. Notice that it easy to change the code for an estimator other than the 
mean. For example, to obtain a bootstrap confidence interval for the median just 
replace the two occurrences of mean with median. We illustrate this discussion in 
the next example. 


Example 4.9.1. In this example, we sample from a known distribution, but, in 
practice, the distribution is usually unknown. Let X 1, X2,...,X» be a random 
sample from a Mh G3) distribution. Since the mean of this Sistibutiot is 0, the 
sample average X is an unbiased estimator of 3. In this example, the X serves as 
our point estimator of 3. The following 20 data points are the realizations (rounded) 
of a random sample of size n = 20 from a I'(1, 100) distribution: 


131.7 182.7 73.3 10.7 150.4 42.3 22.2 17.9 264.0 154.4 
4.3 265.6 61.9 10.8 488 22.5 88 150.6 103.0 85.9 


The value of X for this sample is = 90.59, which is our point estimate of (. 
For illustration, we generated one bootstrap sample of these data. This ordered 
bootstrap sample is 


43 43 43 10.8 10.8 10.8 108 17.9 22.5 42.3 
48.8 48.8 85.9 131.7 131.7 150.4 1544 154.4 264.0 265.6 


The sample mean of this particular bootstrap sample is Z* = 78.725. To obtain 
our bootstrap confidence interval for @, we need to compute many more resam- 
ples. For this computation, we used the R function percentciboot discussed 
above. Let x denote the R vector of the original sample of observations. We se- 
lected 3000 as the number of bootstraps and chose a = 0.10. We used the code 
percentciboot (x,3000,.10) to compute our bootstrap confidence interval. Fig- 
ure 4.9.1 displays a histogram of the 3000 sample means %*s computed by the code. 
The sample mean of these 3000 values is 90.13, close to ¥ = 90.59. Our program also 
obtained a 90% (bootstrap percentile) confidence interval given by (61.655, 120.48), 
which the reader can locate on the figure. It does trap the true value w = 100. 
Exercise 4.9.3 We that if we are poling from a I'(1, 8) distribution, then the 
interval (2nz¥/[x2,,]~(¢/2”), 2n#/[x3,,](¢/?)) is an exact (1 — a)100% confidence in- 
terval for @. Note that, in keeping with our superscript notation for critical values, 
[x3,,]'%) denotes the 100% percentile of a y? distribution with 2n degrees of free- 
dom. This exact 90% confidence interval for our sample is (64.99, 136.69). m 


What about the validity of a bootstrap confidence interval? Davison and Hink- 
ley (1997) discuss the theory behind the bootstrap in Chapter 2 of their book. 
Under some general conditions, they show that the bootstrap confidence interval is 
asymptotically valid. 
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Figure 4.9.1: Histogram of the 3000 bootstrap %*s. The 90% bootstrap confidence 
interval is (61.655, 120.48). 


One way of improving the bootstrap is to use a pivot random variable, a variable 
whose distribution is free of other parameters. For instance, in the last example, 
instead of using X, use X/G5, where 6 = S/\/n and § = [J\(X;—X)?/(n—1)]”?; 
that is, adjust X by its standard error. This is discussed in Exercise 4.9.6. Other 
improvements are discussed in the two books cited earlier. 


Remark 4.9.1. “Briefly, we show that the normal assumption on the distribution 
of @, (4.9.1), is transparent to the argument around expression (4.9.3); see Efron 
and Tibshirani (1993) for further discussion. Suppose H is the cdf of @ and that H 
depends on @. Then, using Theorem 4.8.1, we can find an mene transformation 
@ = m(0) such that the distribution of pee = m(0) is N(¢,07), where ¢ = m(@) 
and o? is some variance. For example, take the transformation to be m(0@) = 
F>1(H(@)), where F.(x) is the cdf of a N(¢,o72) distribution. Then, as above, 


Cc 


A 


(6 — 20-9/2)¢,, 6 — 2(¢/2)g,) is a (1 — a)100% confidence interval for ¢. But note 
that 


ote: = P|o- Zale. <b < g— 2/2) e)| 


= P [m6 - gt De \cO< m(¢— zi@/7),)) . (4.9.6) 


Hence, (m~!(6—20- es e),m-}(g—z(/2)9,)) is a (1— @)100% confidence interval 
for 6. Now suppose 4 Hi is the cdf H with a realization | @ substituted in for 0, ie., 
analogous to the N (0, a3) distribution above. Suppose 6* is a random variable es 
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n~ 


cdf H. Let $= m(@) and oa m(6*). We have 


Oc 


similar to (4.9.3). Therefore, m~!(¢ — z@-¢/2)g,) is the $100th percentile of the 
cdf H. Likewise, m-1(¢ — z(¢/2)a,) is the (1 — $)100th percentile of the cdf H. 
Therefore, in the general case too, the percentiles of the distribution of H form the 
confidence interval for 0. m 


4.9.2 Bootstrap Testing Procedures 


Bootstrap procedures can also be used effectively in testing hypotheses. We begin 
by discussing these procedures for two-sample problems, which cover many of the 
nuances of the use of the bootstrap in testing. 
Consider a two-sample location problem; that is, KX’ = (X1, Xo,...,Xn,) is 
a random sample from a distribution with cdf F(x) and Y’ = (¥1,Y2,...,Yn,) 
is a random sample from a distribution with the cdf F(a — A), where A € R. 
The parameter A is the shift in locations between the two samples. Hence A can 
be written as the difference in location parameters. In particular, assuming that 
the means py and jx exist, we have A = py — ftx. We consider the one-sided 
hypotheses given by 
Hy: A=0O versus H,: A>0O. (4.9.7) 


As our test statistic, we take the difference in sample means, i.e., 
V=Y-X. (4.9.8) 


Our decision rule is to reject Hp if V > c. As is often done in practice, we base 
our decision on the p-value of the test. Recall if the samples result in the values 
1, €2,.-.,%n, and Y1, Y2,---, Yn. With realized sample means 7 and J, respectively, 
then the p-value of the test is 


p=PyH|V >7-T. (4.9.9) 


Our goal is a bootstrap estimate of the p-value. But, unlike the last section, 
the bootstraps here have to be performed when Ho is true. An easy way to do this 
is to combine the samples into one large sample and then to resample at random 
and with replacement the combined sample into two samples, one of size ny (new 
xs) and one of size ng (new ys). Hence the resampling is performed under one 
distribution; i.e., Hp is true. Let B be a positive integer and let v = ¥—Z. Our 
bootstrap algorithm is 


1. Combine the samples into one sample: 2’ = (x’, y’). 


2. Set 7 = 1. 
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3. While 7 < B, do steps 3-6. 


4. Obtain a random sample with replacement of size n, from z. Call the sample 


x"! = (xj, 73,-..,27%,). Compute 7}. 
5. Obtain a random sample with replacement of size ng from z. Call the sample 
y*’ = (YT, Y3,--+sYng)- Compute 9}. 


6. Compute vj = Yj — Fj. 


7. The bootstrap estimated p-value is given by 


B * 
ok F=11¥; = v} 19 
= 9.10 
: (4.9.10) 
Note that the theory cited above for the bootstrap confidence intervals covers this 
testing situation also. Hence, this bootstrap p-value is valid. 


Example 4.9.2. For illustration, we generated data sets from a contaminated nor- 
mal distribution, using the R function rcn. Let W be a random variable with 
the contaminated normal distribution (3.4.17) with proportion of contamination 
e = 0.20 and o, = 4. Thirty independent observations W), W2,...,W39 were gen- 
erated from this distribution. Then we let X; = 10W; + 100 for 1 <i < 15 and 
Y; = 10Wj415 + 120 for 1 <7 < 15. Hence the true shift parameter is A = 20. The 
actual (rounded) data are 


X variates 
94.2 111.3 99.7 116.8 
109.3 106.0 111.9 111.6 


Y variates 
125.5 107.1 98.2. 128.6 
120.3 118.6 111.8 129.3 


Based on the comparison boxplots below, the scales of the two data sets appear to 
be the same, while the y-variates (Sample 2) appear to be shifted to the right of 
x-variates (Sample 1). 


Sample 1 soe] +1-= * 0 
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There are three outliers in the data sets. 

Our test statistic for these data is v = ¥—F = 117.74—111.11 = 6.63. Computing 
with the R function boottesttwo, we performed the bootstrap algorithm given 
above for B = 3000 bootstrap replications. The bootstrap p-value was p* = 0.169. 
This means that (0.169)(3000) = 507 of the bootstrap test statistics exceeded the 
value of the test statistic. Furthermore, these bootstrap values were generated under 
Ho. In practice, Hp would generally not be rejected for a p-value this high. In Figure 
4.9.2, we display a histogram of the 3000 values of the bootstrap test statistic that 
were obtained. The relative area to the right of the value of the test statistic, 6.63, 
is approximately equal to p*. 
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Figure 4.9.2: Histogram of the 3000 bootstrap v*s. Locate the value of the test 
statistic v = 7 — Z = 6.63 on the horizontal axis. The area (proportional to overall 
area) to the right is the p-value of the bootstrap test. 


For comparison purposes, we used the two-sample “pooled” t-test discussed in 
Example 4.6.2 to test these hypotheses. As the reader can obtain in Exercise 4.9.8, 
for these data, t = 0.93 with a p-value of 0.18, which is quite close to the bootstrap 
p-value. 


The above test uses the difference in sample means as the test statistic. Certainly 
other test statistics could be used. Exercise 4.9.7 asks the reader to obtain the 
bootstrap test based on the difference in sample medians. Often, as with confidence 
intervals, standardizing the test statistic by a scale estimator improves the bootstrap 
test. 

The bootstrap test described above for the two-sample problem is analogous to 
permutation tests. In the permutation test, the test statistic is calculated for all 
possible samples of zs and ys drawn without replacement from the combined data. 
Often, it is approximated by Monte Carlo methods, in which case it is quite similar 
to the bootstrap test except, in the case of the bootstrap, the sampling is done with 
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replacement; see Exercise 4.9.10. Usually, the permutation tests and the bootstrap 
tests give very similar solutions; see Efron and Tibshirani (1993) for discussion. 

As our second testing situation, consider a one-sample location problem. Sup- 
pose X1, X9,...,X, is a random sample from a continuous cdf F(x) with finite 
mean jl. Suppose we want to test the hypotheses 


Hy: [6 = po versus Hy: > po, 
where jug is specified. As a test statistic we use X with the decision rule 
Reject Ho in favor of H, if X is too large. 


Let 21, %2,...,%n be the realization of the random sample. We base our decision 
on the p-value of the test, namely, 


where Z is the realized value of the sample average when the sample is drawn. Our 
bootstrap test is to obtain a bootstrap estimate of this p-value. At first glance, one 
might proceed by bootstrapping the statistic X. But note that the p-value must be 
estimated under Hop. To assure that Ho is true, bootstrap the values: 


w= X,—-E+po, t=1,2,...,n. (4.9.11) 


Our bootstrap procedure is to randomly sample with replacement from 21, Z2,..., Zn- 
Let (271,---,2j,1) denote, say, the jth bootstrap sample. As in expression (4.9.4), 
it follows that E(z7;) = mo. Hence, using the zs, the bootstrap resampling is 
performed under Ho. Denote the test statistic by the sample mean Z;. Then the 
bootstrap p-value is 
ame _ Hin {2j > Fh 
= 


Example 4.9.3. To illustrate the bootstrap test just described, consider the fol- 
lowing data set. We generated n = 20 observations X; = 10W; + 100, where W; 
has a contaminated normal distribution with proportion of contamination 20% and 
o- = 4. Suppose we are interested in testing 


(4.9.12) 


Ao: = 90 versus Hy: pw > 90. 


Because the true mean of X; is 100, the null hypothesis is false. The data generated 
are 


119.7 104.1 92.8 854 1086 93.4 67.1 884 101.0 97.2 
95.4 77.2 100.0 114.2 150.3 102.3 105.8 107.5 0.9 94.1 


The sample mean of these values is ¥ = 95.27, which exceeds 90, but is it significantly 
over 90? As discussed above, we bootstrap the values z; = 2; — 95.27+ 90. The 
R function boottestonemean performs this bootstrap test. For the run we did, 


it computed the 3000 values 7}, which are displayed in the histogram in Figure 
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Figure 4.9.3: Histogram of the 3000 bootstrap Z*s discussed in Example 4.9.3. 
The bootstrap p-value is the area (relative to the total area) under the histogram 
and to the right of the 95.27. 


4.9.3. The mean of these 3000 values is 89.96, which is quite close to 90. Of these 
3000 values, 563 exceeded F = 95.27; hence, the p-value of the bootstrap test is 
0.188. The fraction of the total area that is to the right of 95.27 in Figure 4.9.3 is 
approximately equal to 0.188. Such a high p-value is usually deemed nonsignificant; 
hence, the null hypothesis would not be rejected. 

For comparison, the reader is asked to show in Exercise 4.9.12 that the value of 
the one-sample t-test is t = 0.84, which has a p-value of 0.20. A test based on the 
median is discussed in Exercise 4.9.13. m 


EXERCISES 


4.9.1. Consider the sulfur dioxide concentrations data discussed in Example 4.1.3. 
Use the R function percentciboot to obtain a bootstrap 95% confidence interval 
for the true mean concentration. Use 3000 bootstraps and compare it with the 
t-confidence interval for the mean. 


4.9.2. Let 21, 22,...,2%n be the values of a random sample. A bootstrap sample, 


x" = (xj, 03,..., 0%), isarandom sample of 71, 22,...,%p drawn with replacement. 


(a) Show that x},23,...,2*% are iid with common cdf F.,, the empirical cdf of 
T1,U2,--+,Un- 


(b) Show that E(x*) = 7%. 
(c) If n is odd, show that median {a7} = @((n41)/2)- 


(d) Show that V(2t) = n71 Yj (ai — F). 
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4.9.3. Let X1,X2,...,Xp be a random sample from a I'(1, 3) distribution. 


(a) Show that the confidence interval (2nX /(x3,,)~ (0/2), 2nX /(x3,,)(¢/?)) is an 
exact (1 — a)100% confidence interval for (3. 


(b) Using part (a), show that the 90% confidence interval for the data of Example 
4.9.1 is (64.99, 136.69). 


4.9.4. Consider the situation discussed in Example 4.9.1. Suppose we want to 
estimate the median of X; using the sample median. 


(a) Determine the median for a I'(1, 3) distribution. 


(b) The algorithm for the bootstrap percentile confidence intervals is general 
and hence can be used for the median. Rewrite the R code in the func- 
tion percentciboot.s so that the median is the estimator. Using the sample 
given in the example, obtain a 90% bootstrap percentile confidence interval 
for the median. Did it trap the true median in this case? 


4.9.5. Suppose X1, X2,...,X, is a random sample drawn from a N(, 07) distri- 
bution. As discussed in Example 4.2.1, the pivot random variable for a confidence 
interval is 

x= 
S/n? 


where X and § are the sample mean and standard deviation, respectively. Recall 
by Theorem 3.6.1 that ¢ has a Student t-distribution with n—1 degrees of freedom; 
hence, its distribution is free of all parameters for this normal situation. In the 


t (4.9.13) 


notation of this section, 1M, denotes the y100% percentile of a t-distribution with 
n —1 degrees of freedom. Using this notation, show that a (1 — a)100% confidence 
interval for pu is 

(x eto) 210) ), (4.9.14) 


Vn 


4.9.6. Frequently, the bootstrap percentile confidence interval can be improved if 
the estimator 6 is standardized by an estimate of scale. To illustrate this, consider a 


bootstrap for a confidence interval for the mean. Let 27, 75,...,77, be a bootstrap 
sample drawn from the sample 71, %2,...,%n. Consider the bootstrap pivot [analog 
of (4.9.13)]: 
EF 
= (4.9.15) 
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(a) Rewrite the percentile bootstrap confidence interval algorithm using the mean 


and collecting t* for 7 = 1,2,...,B. Form the interval 
St; J 
(z Se a pape) =) , (4.9.16) 


where t*() = ty*B))) that is, order the ¢7s and pick off the quantiles. 


(b) Rewrite the R program percentciboot.s and then use it to find a 90% con- 
fidence interval for ys for the data in Example 4.9.3. Use 3000 bootstraps. 


(c) Compare your confidence interval in the last part with the nonstandardized 
bootstrap confidence interval based on the program percentciboot.s. 


4.9.7. Consider the algorithm for a two-sample bootstrap test given in Section 
4.9.2. 


(a) Rewrite the algorithm for the bootstrap test based on the difference in medi- 
ans. 


(b) Consider the data in Example 4.9.2. By substituting the difference in medians 
for the difference in means in the R program boottesttwo.s, obtain the 
bootstrap test for the algorithm of part (a). 


(c) Obtain the estimated p-value of your test for B = 3000 and compare it to the 
estimated p-value of 0.063 that the authors obtained. 


4.9.8. Consider the data of Example 4.9.2. The two-sample t-test of Example 4.6.2 
can be used to test these hypotheses. The test is not exact here (why?), but it is 
an approximate test. Show that the value of the test statistic is t = 0.93, with an 
approximate p-value of 0.18. 


4.9.9. In Example 4.9.3, suppose we are testing the two-sided hypotheses, 
Hy: 4 = 90 versus H,: uw #90. 
(a) Determine the bootstrap p-value for this situation. 
(b) Rewrite the R program boottestonemean to obtain this p-value. 
(c) Compute the p-value based on 3000 bootstraps. 


4.9.10. Consider the following permutation test for the two-sample problem with 
hypotheses (4.9.7). Let x’ = (a1,%2,...,%n,) and y’ = (yi, y2,---,Yn.) be the 
realizations of the two random samples. The test statistic is the difference in sample 
means 7 — Z. The estimated p-value of the test is calculated as follows: 


1. Combine the data into one sample z’ = (x’, y’). 


2. Obtain all possible samples of size n; drawn without replacement from z. Each 
such sample automatically gives another sample of size na, i.e., all elements 
of z not in the sample of size n;. There are M = (ee) such samples. 
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3. For each such sample 7: 


(a) Label the sample of size n; by x* and label the sample of size ng by y*. 
(b) Calculate y= —s", 


4. The estimated p-value is p* = #{v; > y—T}/M. 


(a) Suppose we have two samples each of size 3 which result in the realizations: 
x’ = (10,15,21) and y’ = (20,25,30). Determine the test statistic and the 
permutation test described above along with the p-value. 


(b) If we ignore distinct samples, then we can approximate the permutation test 
by using the bootstrap algorithm with resampling performed at random and 
without replacement. Modify the bootstrap program boottesttwo.s to do 
this and obtain this approximate permutation test based on 3000 resamples 
for the data of Example 4.9.2. 


(c) In general, what is the probability of having distinct samples in the approx- 
imate permutation test described in the last part? Assume that the original 
data are distinct values. 


4.9.11. Let z* be drawn at random from the discrete distribution that has mass 
n-" at each point z; = x; —%+ po, where (x1, %2,...,%p) is the realization of a 


random sample. Determine E(z*) and V(z*). 


4.9.12. For the situation described in Example 4.9.3, show that the value of the 
one-sample t-test is t = 0.84 and its associated p-value is 0.20. 


4.9.13. For the situation described in Example 4.9.3, obtain the bootstrap test 
based on medians. Use the same hypotheses; i.e., 


Ho: w= 90 versus Hy: pw > 90. 


4.9.14. Consider the Darwin’s experiment on Zea mays discussed in Examples 4.5.1 
and 4.5.5. 


(a) Obtain a bootstrap test for this experimental data. Keep in mind that the 
data are recorded in pairs. Hence your resampling procedure must keep this 
dependence intact and still be under Ho. 


(b) Write an R program that executes your bootstrap test and compare its p-value 
with that found in Example 4.5.5. 


4.10 *Tolerance Limits for Distributions 


We propose now to investigate a problem that has something of the same flavor 
as that treated in Section 4.4. Specifically, can we compute the probability that a 
certain random interval includes (or covers) a preassigned percentage of the prob- 
ability of the distribution under consideration? And, by appropriate selection of 
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the random interval, can we be led to an additional distribution-free method of 
statistical inference? 

Let X be a random variable with distribution function F(a) of the continuous 
type. Let Z = F(X). Then, as shown in Exercise 4.8.1, Z has a uniform(0, 1) 
distribution. That is, Z = F(X) has the pdf 


1 O0<z<i1 
h{z) = { 0 elsewhere. 


Then, if 0 < p < 1, we have 
Pp 
P[F(X) <p] =| dz =p. 
0 


Now F(a) = P(X < a). Since P(X = x) = 0, then F(z) is the fractional part of 
the probability for the distribution of X that is between —oo and a. If F(x) < p, 
then no more than 100p% of the probability for the distribution of X is between 
—oo and x. But recall P[F(X) < p] = p. That is, the probability that the random 
variable Z = F(X) is less than or equal to p is precisely the probability that the 
random interval (—oo, X) contains no more than 100p% of the probability for the 
distribution. For example, if p = 0.70, the probability that the random interval 
(—oo, X) contains no more than 70% of the probability for the distribution is 0.70; 
and the probability that the random interval (—co, X) contains more than 70% of 
the probability for the distribution is 1 — 0.70 = 0.30. 

We now consider certain functions of the order statistics. Let X1,X2,...,Xn 
denote a random sample of size n from a distribution that has a positive and con- 
tinuous pdf f(a) if and only if a < x < b, and let F(x) denote the associated distri- 
bution function. Consider the random variables F'(X,), F(X2),...,F (Xn). These 
random variables are independent and each, in accordance with Exercise 4.8.1, has 
a uniform distribution on the interval (0,1). Thus, F(X1), F(X2),...,F(Xn) isa 
random sample of size n from a uniform distribution on the interval (0,1). Consider 
the order statistics of this random sample F'(X1), F(X2),...,F (Xn). Let Z be the 
smallest of these F(X;), Z the next F'(X;) in order of magnitude, ... , and Z, 
the largest of F(X;). If Yi, Yo,..., Yn are the order statistics of the initial random 
sample X1, X2,...,Xn, the fact that F(x) is a nondecreasing (here, strictly increas- 
ing) function of x implies that Z, = F'(Y1), Z2 = F(Y2),...,Z, = F(Y,). Hence, it 
follows from (4.4.1) that the joint pdf of 71, Z2,...,Z, is given by 


(4.10.1) 


nm! O< 2 < 22 << ay <l 
h(21, 22,---)2n) = 


0 elsewhere. 
This proves a special case of the following theorem. 


Theorem 4.10.1. Let Yi, Y2,...,Y, denote the order statistics of a random sample 
of size n from a distribution of the continuous type that has pdf f(x) and cdf F(a). 
The joint pdf of the random variables Z, = F(Y;), i = 1,2,...,n, is given by 
expression (4.10.1). 
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Because the distribution function of Z = F(X) is given by z, 0 < z < 1, it 
follows from (4.4.2) that the marginal pdf of Z, = F'(Y;,) is the following beta pdf: 


“kK O0<a,<1 


4.10.2 
elsewhere. ( ) 


tule) = { TTI“ ley 


Moreover, from (4.4.3), the joint pdf of Z; = F(Y;) and Z; = F(Y;) is, with 7 < J, 
given by 


i eT Sgae (OSS <1 
h(zi, 23) = ee oe eed (4.10.3) 
elsewhere. 

Consider the difference Z;— Z; = F(Y;)—F (Yi), i< j. Now F(y;) = P(X < y;) 
and F(y;) = P(X < y;). Since P(X = y;) = P(X = y;) = 0, then the difference 
F(y;)—F(y:) is that fractional part of the probability for the distribution of X that 
is between y; and y;. Let p denote a positive proper fraction. If F(y;) — F (yi) > p, 
then at least 100p% of the probability for the distribution of X is between y; and 
y;. Let it be given that y = P[F(Y;) — F(Y;) > p]. Then the random interval 
(Y;, Y;) has probability y of containing at least 100p% of the probability for the 
distribution of X. Now if y; and y; denote, respectively, observational values of Y; 
and Y;, the interval (y;, y;) either does or does not contain at least 100p% of the 
probability for the distribution of X. However, we refer to the interval (y;,y;) as 
a 1007% tolerance interval for 100p% of the probability for the distribution of 
X. In like vein, y; and y; are called the 1007% tolerance limits for 100p% of the 
probability for the distribution of X. 

One way to compute the probability y = P[F(Y;)—F (Yi) > p 
(4.10.3), which gives the joint pdf of Z; = F(Y;) ‘and Z; = F( 
probability is then given by 


| is to use equation 
Y;). The required 


1—p 1 
0 P 


+25 


Sometimes, this is a rather tedious computation. For this reason and also for the 
reason that coverages are important in distribution-free statistical inference, we 
choose to introduce at this time the concept of coverage. 

Consider the random variables Wy = F(Yi) = 21, Wo = F(Y¥2) — F(¥i) = 
22 = “i, and W3 = F(Y3) =. F(Y2) = Z3 = Z2,..-,Wn = F(Y,) = F(Yn-1) = 
Zn — Zn—1. The random variable W, is called a coverage of the random interval 
{x :—o0o <a < Y;} and the random variable W;, 1 = 2,3,...,n, is called a coverage 
of the random interval {a : Yji-1 < a < Y;}. We find that the joint pdf of the n 
coverages W,, W2,..., Wr». First we note that the inverse functions of the associated 
transformation are given by 


i 
a= ) wy, fort = 1,2;...,7 
j=l 
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We also note that the Jacobian is equal to 1 and that the space of positive probability 
density is 


{(wi, We,---,Wn):0< w, i= 1,2,...,n, wi t--- +n < 1}. 


Since the joint pdf of 71, Z,...,Z, is n!, 0 < 21 < 22 < ++: < 2, < 1, zero 
elsewhere, the joint pdf of the n coverages is 


k(w a ee n! O<w, t=1,...,n, wrt- Wn <1 
Treveytn? | 0 elsewhere. 


Because the pdf k(w1,..., Wn) is symmetric in wi, w2,..., Wn, it is evident that the 
distribution of every sum of r, r <n, of these coverages W1,...,W,, is exactly the 
same for each fixed value of r. For instance, if 7 < 7 and r = j — 7, the distribution 
of Z; — Z; = F(Y;) — F(%) = Wi4i + Wi42 +--- + W, is exactly the same as that 
of Z5-i = F(Y;-:) = Wy + W tere t W;-i- But we know that the pdf of Zj-i is 
the beta pdf of the form 


T(n+1) j—j— n—jti 
hji(v) = 4 TO=Or a ee a 
~ 0 elsewhere. 


Consequently, F'(Y;) — F(Y;) has this pdf and 


PIF) — Fi) 2 A= fF hj-alv) de. 


Example 4.10.1. Let Y, < Yo < --- < Ye be the order statistics of a random 
sample of size 6 from a distribution of the continuous type. We want to use the 
observed interval (yi, y¢) as a tolerance interval for 80% of the distribution. Then 


7 = PIF (Ye) — F(Yi) 2 0.8] 
0.8 
= -f 30v4(1 — v) dv, 
0 
because the integrand is the pdf of F'(Ys) — F(Y). Accordingly, 
7 = 1 —6(0.8)° +5(0.8)° = 0.34, 


approximately. That is, the observed values of Y; and Yg define a 34% tolerance 
interval for 80% the probability for the distribution. 


Remark 4.10.1. Tolerance intervals are extremely important and often they are 
more desirable than confidence intervals. For illustration, consider a “fill” problem 
in which a manufacturer says that each container has at least 12 ounces of the 
product. Let X be the amount in a container. The company would be pleased to 
note that the interval (12.1, 12.3), for instance, is a 95% tolerance interval for 99% 
of the distribution of X. This would be true in this case, because the FDA allows 
a very small fraction of the containers to be less than 12 ounces. m 
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EXERCISES 


4.10.1. Let Y, and Y, be, respectively, the first and the nth order statistic of a 
random sample of size n from a distribution of the continuous type having cdf F(z). 
Find the smallest value of n such that P[F'(Y,) — F(Yi) > 0.5] is at least 0.95. 


4.10.2. Let Y2 and Y,-1 denote the second and the (n — 1)st order statistics of 
a random sample of size n from a distribution of the continuous type having a 
distribution function F'(#). Compute P[F(Y,-1) — F(Y2) > p], where 0 < p< 1. 


4.10.3. Let Y; < Yo <--- < Y4g be the order statistics of a random sample of size 
48 from a distribution of the continuous type. We want to use the observed interval 
(ya, yas) as a 1007% tolerance interval for 75% of the distribution. 


(a) What is the value of 7? 


(b) Approximate the integral in part (a) by noting that it can be written as a par- 
tial sum of a binomial pdf, which in turn can be approximated by probabilities 
associated with a normal distribution (see Section 5.3). 


4.10.4. Let Yi < Yo <--- < Y, be the order statistics of a random sample of size 
n from a distribution of the continuous type having distribution function F(x). 


(a) What is the distribution of U = 1— F(Y;)? 


(b) Determine the distribution of V = F(Y,) — F(Y;) + F(¥i) — F(%), where 
25. 


4.10.5. Let Y; < Yo <--- < Yio be the order statistics of a random sample from 
a continuous-type distribution with distribution function F(a). What is the joint 
distribution of V; = F(¥1) — F(Y2) and V2 = F(Y10) — F(¥%)? 
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Chapter 5 


Consistency and Limiting 
Distributions 


In Chapter 4, we introduced some of the main concepts in statistical inference, 
namely, point estimation, confidence intervals, and hypothesis tests. For readers 
who on first reading have skipped Chapter 4, we review these ideas in Section 5.1.1. 

The theory behind these inference procedures often depends on the distribution 
of a pivot random variable. For example, suppose X1, X2,...,X, is a random 
sample on a random variable X which has a N(ju,07) distribution. Denote the 
sample mean by X, = n~' 97", X;. Then the pivot random variable of interest is 

Xn — pt 


G2 = 


o/ Jn” 


This random variable plays a key role in obtaining exact procedures for the con- 
fidence interval for and for tests of hypotheses concerning 4. What if X does 
not have a normal distribution? In this case, in Chapter 4, we discussed inference 
procedures, which were quite similar to the exact procedures, but they were based 
on the “approximate” (as the sample size n gets large) distribution of Z,,. 

There are several types of convergence used in statistics, and in this chapter we 
discuss two of the most important: convergence in probability and convergence in 
distribution. These concepts provide structure to the “approximations” discussed 
in Chapter 4. Beyond this, though, these concepts play a crucial role in much of 
statistics and probability. We begin with convergence in probability. 


5.1 Convergence in Probability 
In this section, we formalize a way of saying that a sequence of random variables 
{X,,} is getting “close” to another random variable X, as n > oo. We will use this 


concept throughout the book. 
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Definition 5.1.1. Let {X,} be a sequence of random variables and let X be a ran- 
dom variable defined on a sample space. We say that X,, converges in probability 
to X if, for alle > 0, 

lim P[|X, — X| > «] =0, 


n—oo 


or equivalently, 
lim P[|X, -—X|<e¢«=1. 


If so, we write 
ee 


If X, ene , we often say that the mass of the difference X,, — X is converging 
to 0. In statistics, often the limiting random variable X is a constant; ie., X is a 
degenerate random variable with all its mass at some constant a. In this case, we 


: P ; 
write X,, > a. Also, as Exercise 5.1.1 shows, for a sequence of real numbers {a,,}, 


Gn — a is equivalent to an 4 a. 

One way of showing convergence in probability is to use Chebyshev’s Theorem 
(1.10.3). An illustration of this is given in the following proof. To emphasize the fact 
that we are working with sequences of random variables, we may place a subscript 
n on the appropriate random variables; for example, write X as Xp. 


Theorem 5.1.1 (Weak Law of Large Numbers). Let {X,} be a sequence of tid 
random variables having common mean fp and variance 0? < oo. Let Xn = 
ny, Xi. Then 

ear 
Proof. From expression (2.8.6) of Example 2.8.1, the mean and variance of X,, are 
and o?/n, respectively. Hence, by Chebyshev’s Theorem, we have for every € > 0, 


Pl[Rn— Hl > 4 =PKn— ul > (eVa/o\(o/val< 0. m 


This theorem says that all the mass of the distribution of X,, is converging to 1, 
as n — oo. Inasense, for n large, Xp, is close to pz. But how close? For instance, if 
we were to estimate ys by X,,, what can we say about the error of estimation? We 
answer this in Section 5.3. 

Actually, in a more advanced course, a Strong Law of Large Numbers is proved; 
see page 124 of Chung (1974). One result of this theorem is that we can weaken the 
hypothesis of Theorem 5.1.1 to the assumption that the random variables X; are 
independent and each has finite mean yp. Thus the Strong Law of Large Numbers 
is a first moment theorem, while the Weak Law requires the existence of the second 
moment. 

There are several theorems concerning convergence in probability which will 
be useful in the sequel. Together the next two theorems say that convergence in 
probability is closed under linearity. 


Theorem 5.1.2. Suppose Xn FX and Y¥;, EY. Then Xnt+VYn Ene 4 +Y. 
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Proof: Let « > 0 be given. Using the triangle inequality, we can write 
|X, —X|+ |¥n —Y| 2 (Xn + Yn) — (X+Y)| 2 €. 
Since P is monotone relative to set containment, we have 


Pl|(Xn + Yn) —(X + Y)] > €] Pl|[Xn — X|+|¥n — Y| 2 €] 


< 
< Pl|Xn—X|> /2]+Pll%—Y1 > €/2. 


By the hypothesis of the theorem, the last two terms converge to 0 as n — on, 
which gives us the desired result. m 


Theorem 5.1.3. Suppose X, +, X anda is a constant. Then aXn Fo aX. 


Proof: If a = 0, the result is immediate. Suppose a 4 0. Let ¢ > 0. The result 
follows from these equalities: 


PllaXn — aX| > €] = Pllal|Xn — X| > €] = P[|Xn — X] 2 €/al], 


and by hypotheses the last term goes to 0 as n — oo 


Theorem 5.1.4. Suppose Xy +, a and the real function g is continuous ata. Then 
P 
9(Xn) > g(a) - 


Proof: Let « > 0. Then since g is continuous at a, there exists a 6 > 0 such that if 
|x — a| < 6, then |g(a) — g(a)| < «. Thus 


lg(x) — g(a)| > => |a—al 20. 
Substituting X,, for x in the above implication, we obtain 
Pllg(Xn) — 9(@)| 2 €] S Pl|Xn — al 2 9]. 
By the hypothesis, the last term goes to 0 as n — oo, which gives us the result. m 


This theorem gives us many useful results. For instance, if X, 4 a, then 


x? - @ 
1/Xn ey 1/a, provided a 4 0 
JX, + Va, provided a> 0. 


Actually, in a more advanced class, it is shown that if X, +, X and gisa 


continuous function, then g(X»,) s g(X); see page 104 of Tucker (1967). We make 
use of this in the next theorem. 


Theorem 5.1.5. Suppose Xn Ea X and Y);, En Y. Then XnYn En XY. 
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Proof: Using the above results, we have 


1 1 1 
AY, = 5? 4097 = (4, = vr 
D 2 2 
P los, li». 1 2 
a? Cy? (XH VPS XY, 
= 3 D 3 ) 7 


5.1.1 Sampling and Statistics 


Consider the situation where we have a random variable X whose pdf (or pmf) is 
written as f(a;0) for an unknown parameter 6 € Q. For example, the distribution 
of X is normal with unknown mean p and variance 07. Then 6 = (y,07) and 
Q = {0 = (",07) : -00 < pp < o,0 > Of. As another example, the distribution 
of X is 11,3), where 6 > 0 is unknown. Our information consists of a random 
sample X), X2,...,X, on X; i.e., X1, X2,..., Xp are independent and identically 
distributed (iid) random variables with the common pdf f(x;6), 0 € Q. We say 
that T is a statistic if T is a function of the sample; ie., T = T(X1, Xo,...,Xn). 
Here, we want to consider T as a point estimator of 0. For example, if p is 
the unknown mean of X, then we may use as our point estimator the sample mean 
X =n! So, X;. When the sample is drawn let 71, 72,...,@» denote the observed 
values of X1, X2,...,Xn. We call these values the realized values of the sample 
and call the realized statistic t = t(@1,7v2,...,U%p) a point estimate of 0. 

In Chapters 6 and 7, we discuss properties of point estimators in formal settings. 
For now, we consider two properties: unbiasedness and consistency. We say 
that the point estimator T for 0 is unbiased if E(T) = 6. Recall in Section 
2.8, we showed that the sample mean X and the sample variance S? are unbiased 
estimators of 4: and o? respectively; see equations (2.8.6) and (2.8.8). We next 
consider consistency of a point estimator. 


Definition 5.1.2 (Consistency). Let X be a random variable with cdf F(ax,@), 
O6€Q. Let Xy,...,Xn be a sample from the distribution of X and let T;, denote a 
statistic. We say T,, is a consistent estimator of 0 if 


n 4.6, 


If X,,...,X» is a random sample from a distribution with finite mean pz and 
variance o”, then by the Weak Law of Large Numbers, the sample mean, Xp, is a 
consistent estimator of ju. 

Figure 5.1.1 displays realizations of the sample mean for samples of size 10 to 
2000 in steps of 10 which are drawn from a N(0,1) distribution. The lines on the 
plot encompass the interval js + 0.04 for » = 0. As n increases, the realizations 
tend to stay within this interval, verifying the consistency of the sample mean. The 
R function consistmean produces this plot. Within this function, if the function 
mean is changed to median a similar plot on the estimator med X; can be obtained. 


Example 5.1.1 (Sample Variance). Let X1,...,X,» denote a random sample from 
a distribution with mean p and variance o?. In Example 2.8.7, we showed that. the 
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Figure 5.1.1: Realizations of the point estimator X for samples of size 10 to 2000 
in steps of 10 which are drawn from a N(0,1) distribution. 


sample variance is an unbiased estimator of 7?. We now show that it is a consistent 


estimator of 7”. Recall Theorem 5.1.1 which shows that X, be pu. To show that the 
sample variance converges in probability to o?, assume further that E[X}] < 00, so 
that Var(S?) < oo. Using the preceding results, we can show the following: 


L. — n lig —2 
— S ix, = — 5 Pere 
Sh m= 1 ( ) n—1 n a n 


i=l 


P 2 2 2 

2, 1. [B(X?) — 2] = 0. 
Hence the sample variance is a consistent estimator of o?. From the discussion 
above, we have immediately that S,, = o; that is, the sample standard deviation is 


a consistent estimator of the population standard deviation. m 


Unlike the last example, sometimes we can obtain the convergence by using the 
distribution function. We illustrate this with the following example: 


Example 5.1.2 (Maximum of a Sample from a Uniform Distribution). Suppose 
X1,...,Xpn is a random sample from a uniform(0,6) distribution. Suppose @ is 
unknown. An intuitive estimate of 0 is the maximum of the sample. Let Y, = 
max {X1,...,X,}. Exercise 5.1.4 shows that the cdf of Y,, is 


H=4 Gy” a<t<e (5.1.1) 
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Hence the pdf of Y,, is 


inl JQ<et<¢@ 


— gr 
Fr, (t) =) 0 elsewhere. (ot) 


Based on its pdf, it is easy to show that E(Y,) = (n/(n+1))6. Thus, Y,, is a biased 
estimator of 0. Note, however, that ((n + 1)/n)Y,, is an unbiased estimator of 0. 


Further, based on the cdf of Y,,, it is easily seen that Y, ae and, hence, that the 
sample maximum is a consistent estimate of 6. Note that the unbiased estimator, 
((1+1)/n)Yp, is also consistent. m 


To expand on Example 5.1.2, by the Weak Law of Large Numbers, Theorem 
5.1.1, it follows that X, is a consistent estimator of 6/2, so 2X, is a consistent 
estimator of 6. Note the difference in how we showed that Y,, and 2X, converge to 
6 in probability. For Y,, we used the cdf of Y,,, but for 2X, we appealed to the Weak 
Law of Large Numbers. In fact, the cdf of 2X, is quite complicated for the uniform 
model. In many situations, the cdf of the statistic cannot be obtained, but we can 
appeal to asymptotic theory to establish the result. There are other estimators of 
8. Which is the “best” estimator? In future chapters we will be concerned with such 
questions. 

Consistency is a very important property for an estimator to have. It is a poor 
estimator that does not approach its target as the sample size gets large. Note that 
the same cannot be said for the property of unbiasedness. For example, instead of 
using the sample variance to estimate a”, suppose we use V = n~! S0j_,(X;— X)?. 
Then V is consistent for 07, but it is biased, because E(V) = (n — 1)o?/n. Thus 
the bias of V is —o?/n, which vanishes as n — 00. 


EXERCISES 


5.1.1. Let {a,,} be a sequence of real numbers. Hence, we can also say that {a,,} 
is a sequence of constant (degenerate) random variables. Let a be a real number. 


Show that a, — a is equivalent to ay, na a. 
5.1.2. Let the random variable Y,, have a distribution that is b(n, p). 


(a) Prove that Y,,/n converges in probability to p. This result is one form of the 
weak law of large numbers. 


(b) Prove that 1— Y,,/n converges in probability to 1 — p. 
(c) Prove that (Y;,/n)(1 — Y;,/n) converges in probability to p(1 — p). 


5.1.3. Let W,, denote a random variable with mean p and variance b/n?, where 
p > 0, p, and 6 are constants (not functions of n). Prove that W,, converges in 
probability to p. 

Hint: Use Chebyshev’s inequality. 


5.1.4. Derive the cdf given in expression (5.1.1). 
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5.1.5. Consider the R function consistmean which produces the plot shown in 
Figure 5.1.1. Obtain a similar plot for the sample median when the distribution 
sampled is the N(0,1) distribution. Compare the mean and median plots. 


5.1.6. Write an R function that obtains a plot similar to Figure 5.1.1 for the situ- 
ation described in Example 5.1.2. For the plot choose 6 = 10. 


5.1.7. Let X1,...,X» be iid random variables with common pdf 


f(x) = 


(a0), _ 
{6 zt >, -—c <9<c (5.1.3) 


0 elsewhere. 


This pdf is called the shifted exponential. Let Y,, = min{X1,...,X,}. Prove 
that Y, — @ in probability by first obtaining the cdf of Y,. 


5.1.8. Using the assumptions behind the confidence interval given in expression 


(4.2.9), show that 
8 Oa 4 8 
ny ne ny ne 


5.1.9. For Exercise 5.1.7, obtain the mean of Y,,. Is Y, an unbiased estimator of 
6? Obtain an unbiased estimator of 6 based on Y,. 


5.2 Convergence in Distribution 


In the last section, we introduced the concept of convergence in probability. With 
this concept, we can formally say, for instance, that a statistic converges to a pa- 
rameter and, furthermore, in many situations we can show this without having to 
obtain the distribution function of the statistic. But how close is the statistic to the 
estimator? For instance, can we obtain the error of estimation with some credence? 
The method of convergence discussed in this section, in conjunction with earlier 
results, gives us affirmative answers to these questions. 


Definition 5.2.1 (Convergence in Distribution). Let {X,,} be a sequence of random 
variables and let X be a random variable. Let Fx, and Fx be, respectively, the cdfs 
of X, and X. Let C(Fx) denote the set of all points where Fx is continuous. We 
say that X,, converges in distribution to X if 


lim Fy, (2) =F x(x), for allx € C(Fx). 


We denote this convergence by 
Rn & 


Remark 5.2.1. This material on convergence in probability and in distribution 
comes under what statisticians and probabilists refer to as asymptotic theory. Of- 
ten, we say that the distribution of X is the asymptotic distribution or the 
limiting distribution of the sequence {X,}. We might even refer informally to 
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the asymptotics of certain situations. Moreover, for illustration, instead of saying 
Xn 2 x , where X has a standard normal distribution, we may write 


x, SN G1) 


as an abbreviated way of saying the same thing. Clearly, the right-hand member 
of this last expression is a distribution and not a random variable as it should be, 
but we will make use of this convention. In addition, we may say that X,, has 


a limiting standard normal distribution to mean that Xp, ed , where X has a 
standard normal random, or equivalently X,, AN (0,1). = 

Motivation for considering only points of continuity of F'y is given by the fol- 
lowing simple example. Let X,, be a random variable with all its mass at + and 
let X be a random variable with all its mass at 0. Then, as Figure 5.2.1 shows, 


all the mass of X,, is converging to 0, i.e., the distribution of X. At the point of 
discontinuity of Fy, lim Fy, (0) = 0 4 1 = Fx (0), while at continuity points x of 


Fx (ie., 2 £ 0), lim Fx, (x) = Fx (a). Hence, according to the definition, X;, ca 


F.(x) 
A 
le SS 
a > xX 
(0, 0) n! 


Figure 5.2.1: Cdf of X,, that has all its mass at n~!. 


Convergence in probability is a way of saying that a sequence of random variables 
X,, is getting close to another random variable X. On the other hand, convergence 
in distribution is only concerned with the cdfs Fx, and Fx. A simple example 
illustrates this. Let X be a continuous random variable with a pdf fx(a) that 
is symmetric about 0; ie., fx(—2) = fx(x). Then it is easy to show that the 
density of the random variable —X is also fx(x). Thus, X and —X have the same 
distributions. Define the sequence of random variables X,, as 

Xn={ X if nis odd 


—X ifn is even. (5.2.1) 
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Clearly, F'x,,(%) = Fx (x) for all x in the support of X, so that X,, 2X. On the 
other hand, the sequence X,, does not get close to X. In particular, X, # X in 
probability. 


Example 5.2.1. Let X,, have the cdf 


—nw? /2 dw 


ee ii ae 
Felt) = [. J1/n/2n 


If the change of variable v = ,/nw is made, we have 


Vnz 4 
Fn(Z) ae / me dv. 


It is clear that 


0 z<0 
lim 2, =<.— #=0 
ict 1 z>0. 


Now the function 
F(z) = { 0 2 <0 


is a cdf and lim, F,(%) = F(%) at every point of continuity of F(Z). To be 
sure, limn oo Fn(0) 4 F(0), but F(Z) is not continuous at F = 0. Accordingly, the 
sequence X1, X2,X3,... converges in distribution to a random variable that has a 
degenerate distribution at 7 = 0. 


Example 5.2.2. Even if a sequence X,, X2, X3,... converges in distribution to a 
random variable X, we cannot in general determine the distribution of X by taking 
the limit of the pmf of X,,. This is illustrated by letting X, have the pmf 


iia 1 #w«=24+n7! 
Pri?) = 9 elsewhere. 


Clearly, limp—+oopn(x) = 0 for all values of «. This may suggest that X,, for 
n = 1,2,3,..., does not converge in distribution. However, the cdf of X;, is 


—f 0 «<24+n! 
Pyle) = 1 #>24n7}, 


and 
: _j 0 «<2 
dim Fala) = { § a > 2. 
Since 
0 2<2 
Fe) ={ 1 «>2 


is a cdf, and since limn..0o F,(x) = F(x) at all points of continuity of F(a), the 
sequence X,, X2, X3,... converges in distribution to a random variable with cdf 
F(a). a 
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The last example shows in general that we cannot determine limiting distribu- 
tions by considering pmfs or pdfs. But under certain conditions we can determine 
convergence in distribution by considering the sequence of pdfs as the following 
example shows. 


Example 5.2.3. Let 7, have a ¢t-distribution with n degrees of freedom, n = 
1,2,3,.... Thus its cdf is 


_ ff Ti(n+1)/2] 1 


—co 


where the integrand is the pdf f,(y) of T,. Accordingly, 
t t 
lim F,,(t) = lim Fn(y) dy =| lim fn(y) dy, 


by a result in analysis (the Lebesgue Dominated Convergence Theorem) that allows 
us to interchange the order of the limit and integration, provided that |f,(y)| is 
dominated by a function that is integrable. This is true because 


|fn(y)| < 10fi(y) 
and 


‘ 10 
10fi(y) dy = = arctant < oo, 


—Cco 


for all real t. Hence we can find the limiting distribution by finding the limit of the 
pdf of T;,. It is 


T 1)/2 1 
lim fr(y) = lim pail dost el lim {| 
—n/2 
1 y? 
l — |{{14+- : 
tn velo) f 
Using the fact from elementary calculus that 


ip) n 5 
et (1+) = et”, 
n—-0o n 


the limit associated with the third factor is clearly the pdf of the standard normal 
distribution. The second limit obviously equals 1. By Remark 5.2.2, the first limit 
also equals 1. Thus, we have 


t 
1 2 
lim Fi, (t) =| e¥ /? dy, 
n—- oo Co Jor 


and hence T), has a limiting standard normal distribution. m 
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Remark 5.2.2 (Stirling’s Formula). In advanced calculus the following approxi- 
mation is derived: 
T(k +1) & Vorkhtt/26-*, (5.2.2) 


This is known as Stirling’s formula and it is an excellent approximation when k is 
large. Because [(k+1) = k!, for & an integer, this formula gives an idea of how fast 
k! grows. As Exercise 5.2.21 shows, this approximation can be used to show that 
the first limit in Example 5.2.3 is 1. m 


Example 5.2.4 (Maximum of a Sample from a Uniform Distribution, Continued). 
Recall Example 5.1.2, where Xj,...,X, is a random sample from a uniform(0, @) 
distribution. Again, let Y, = max{Xj,...,X,}, but now consider the random 
variable Z,, = n(0—Y,,). Let t € (0,n0). Then, using the cdf of Y,,, (5.1.1), the cdf 
of Z,, is 


PlZn<t] = P[Y%, >O-(t/n)] 


(8) 


= tag, 


Note that the last quantity is the cdf of an exponential random variable with mean 
0, (3.3.6), ie., [(1,@). So we say that Z,, 2, Z, where Z is distributed T(1,0). a 


Remark 5.2.3. To simplify several of the proofs of this section, we make use of 
the lim and lim of a sequence. For readers who are unfamiliar with these concepts, 
we discuss them in Appendix A. In this brief remark, we highlight the properties 
needed for understanding the proofs. Let {a,} be a sequence of real numbers and 
define the two subsequences 


by, = sup{dn, Qn41,.--$}, n=1,2,3..., (5.2.3) 
Cy = Wt enyGnaiss shy GS 1j2,3...%. (5.2.4) 


The sequences {b,} and {c,} are nonincreasing and nondecreasing, respectively. 
Hence their limits always exist (may be +oo) and are denoted respectively by 
lim, oo Gy, and lim,,_.45 @n- Further, cn < adn < bn, for all n. Hence, by the Sand- 
wich Theorem (see Theorem A.2.1 of Appendix A), if lim, ..,@n = impo an, 
then limy—o4. Gn exists and is given by limp. dn = Tim) +00 Gn. 

As discussed in Appendix A, several other properties of these concepts are useful. 
For example, suppose {p,,} is a sequence of probabilities and lim;,—.o Pn = 0. Then, 
by the Sandwich Theorem, since 0 < py, < sup{Pn,Pn4i,---} for all n, we have 
limp—+oo Pn = 0. Also, for any two sequences {a,,} and {b,,}, it easily follows that 
Timp—oo(@n + bn) < Timpsoo dn + TiMp—soo bn. 


As the following theorem shows, convergence in distribution is weaker than con- 
vergence in probability. Thus convergence in distribution is often called weak con- 
vergence. 
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Theorem 5.2.1. If X, converges to X in probability, then Xp converges to X in 
distribution. 


Proof: Let x be a point of continuity of Fy (a). For every € > 0, 
Fy, (a) = P[X, <a 
PI{LXn SEN {|Xn — X| < €}] + Pl{Xn Se} {|Xn — X| = ef] 
< PIX <a+el4+P[|X, —X| > ed. 


Based on this inequality and the fact that X, Ex , we see that 
Jim, Fx, (a) < Fx(x+e). (5.2.5) 
To get a lower bound, we proceed similarly with the complement to show that 
P[X, >a) < PIX >x—-—e+P||X, —X| >. 


Hence 
lim Fy, (x) > Fx (a — €). (5.2.6) 


Nl— Co 


Using a relationship between lim and lim, it follows from (5.2.5) and (5.2.6) that 


Fx(@—€)< lim Fy, (z)< lim Py, (@) < Fe(e +e). 
Letting € | 0 gives us the desired result. 


Reconsider the sequence of random variables { X,,} defined by expression (5.2.1). 


P 
Here, Xn, ?, X but Xn # X. So, in general, the converse of the above theorem is 
not true. However, it is true if X is degenerate, as shown by the following theorem. 


Theorem 5.2.2. If X;, converges to the constant b in distribution, then Xp, con- 
verges to b in probability. 


Proof: Let € > 0 be given. Then 
lim P[|X, —6| <«] = lim Fx, (b+e)— lim Fx, [(b-—«) —0]) =1-0=1, 


n—oo 


which is the desired result. m 


A result that will prove quite useful is the following: 


Theorem 5.2.3. Suppose X,, converges to X in distribution and Y;, converges in 
probability to 0. Then Xn + Yn converges to X in distribution. 


The proof is similar to that of Theorem 5.2.2 and is left to Exercise 5.2.13. We 
often use this last result as follows. Suppose it is difficult to show that X,, converges 
to X in distribution, but it is easy to show that Y,, converges in distribution to 
X and that X, — Y, converges to 0 in probability. Hence, by this last theorem, 
Xn = Yn+(Xn —Yn) 3 X, as desired. 

The next two theorems state general results. A proof of the first result can 
be found in a more advanced text, while the second, Slutsky’s Theorem, follows 
similarly to that of Theorem 5.2.1. 
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Theorem 5.2.4. Suppose X, converges to X in distribution and g is a continuous 
function on the support of X. Then g(Xp) converges to g(X) in distribution. 


An often-used application of this theorem occurs when we have a sequence of 
random variables Z,, which converges in distribution to a standard normal random 
variable Z. Because the distribution of Z? is y?(1), it follows by Theorem 5.2.4 
that Z? converges in distribution to a y?(1) distribution. 


Theorem 5.2.5 (Slutsky’s Theorem). Let X,, X, An, and B,, be random variables 
and let a and b be constants. If Xj, 2 X, An Ex a, and By, En b, then 


A, BR Ds a OR, 


5.2.1. Bounded in Probability 


Another useful concept, related to convergence in distribution, is boundedness in 
probability of a sequence of random variables. 

First consider any random variable X with cdf Fy(x). Then given € > 0, we 
can bound X in the following way. Because the lower limit of Fy is 0 and its upper 
limit is 1, we can find 7; and 72 such that 


Fx (a) < €/2 for « < m and Fx(x) > 1— (€/2) for x > np. 
Let 7 = max{|7|, |\n2|}. Then 
PIX| <q] = Fe(n) —Fx(-9—0) > 1-(€/2)—(e/2)=1-« = 6.2.7) 


Thus random variables which are not bounded [e.g., X is N(0,1)] are still bounded 
in this probability way. This is a useful concept for sequences of random variables, 
which we define next. 


Definition 5.2.2 (Bounded in Probability). We say that the sequence of random 
variables {X,,} is bounded in probability if, for all € > 0, there exist a constant 
B. > 0 and an integer N. such that 


n>N. => P||X,| < B.] >1l-e. 


Next, consider a sequence of random variables {X,,} which converges in distri- 
bution to a random variable X that has cdf F’. Let € > 0 be given and choose 7) so 
that (5.2.7) holds for X. We can always choose 7 so that 7 and —7 are continuity 
points of F. We then have 


lim P[|Xn| <1] 2 lim Fx,,(n) — lim Fx, (—7 — 0) = Fx(q) — Fx(-n) 2 1-€. 


To be precise, we can then choose N so large that P[|X,| <7] >1—e, forn > N. 
We have thus proved the following theorem 


Theorem 5.2.6. Let {X,,} be a sequence of random variables and let X be a random 
variable. If X,— X in distribution, then {X,} is bounded in probability. 
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As the following example shows, the converse of this theorem is not true. 


Example 5.2.5. Take {X,,} to be the following sequence of degenerate random 
variables. For n = 2m even, Xam = 2+(1/(2m)) with probability 1. For n = 2m—1 
odd, X2m—1 = 1+(1/(2m)) with probability 1. Then the sequence {X2, X4, X6,...} 
converges in distribution to the degenerate random variable Y = 2, while the se- 
quence {X 1, X3, X5,...} converges in distribution to the degenerate random variable 
W = 1. Since the distributions of Y and W are not the same, the sequence {X,,} 
does not converge in distribution. Because all of the mass of the sequence {X,,} is 
in the interval [1,5/2], however, the sequence {X,,} is bounded in probability. m 


One way of thinking of a sequence that is bounded in probability (or one that is 
converging to a random variable in distribution) is that the probability mass of |X,,| 
is not escaping to co. At times we can use boundedness in probability instead of 
convergence in distribution. A property we will need later is given in the following 
theorem: 


Theorem 5.2.7. Let {X,} be a sequence of random variables bounded in probability 
and let {Y,,} be a sequence of random variables that converges to 0 in probability. 
Then 

XY. 0, 


Proof: Let € > 0 be given. Choose B, > 0 and an integer N, such that 
n>N. => P||X,| < B.] >1l-e. 
Then 
dim P|Xn¥n| > < tim Pl|XnYal 2 €,|Xnl < Bel 
+ lim P[|Xn¥n| > €|Xnl > Be] 
< lim PIY,| = «/B)] +¢=«, (5.2.8) 


from which the desired result follows. @ 


5.2.2 A-Method 


Recall a common problem discussed in the last three chapters is the situation where 
we know the distribution of a random variable, but we want to determine the 
distribution of a function of it. This is also true in asymptotic theory, and Theorems 
5.2.4 and 5.2.5 are illustrations of this. Another such result is called the A-method. 
To establish this result, we need a convenient form of the mean value theorem 
with remainder, sometimes called Young’s Theorem; see Hardy (1992) or Lehmann 
(1999). Suppose g(x) is differentiable at 2. Then we can write 


gy) = 9(x) + g'(x)(y — x) + o(\y — 2}), (5.2.9) 
where the notation o means 


a = o(b) if and only if $ — 0, as b > 0. 
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The lttle-o notation is used in terms of convergence in probability, also. We 
often write o,(X,,), which means 


Yn = p(X) if and only if 4 4 0, asin > oo. (5.2.10) 
There is a corresponding big-O, notation, which is given by 
Y, = O,(X7) if and only if = is bounded in probability asm — oo. (5.2.11) 


The following theorem illustrates the little-o notation, but it also serves as a 
lemma for Theorem 5.2.9. 


Theorem 5.2.8. Suppose {Y,} is a sequence of random variables that is bounded 


in probability. Suppose X, = 0,(Y,). Then X, = 0, as n > co. 


Proof: Let « > 0 be given. Because the sequence {Y,,} is bounded in probability, 
there exist positive constants N. and B, such that 


n> N. => PllYn| < BJ >1-e. (5.2.12) 


Also, because X;, = 0p(Y;,), we have 


— *0, (5.2.13) 
as n — oo. We then have 
P||Xn| 2 €| = P||Xn| = , IYn| S Bl + P||Xn| = €, [Yn > B.| 


X, € 
P| > = PUY, B,|. 


By (5.2.13) and (5.2.12), respectively, the first and second terms on the right side 
can be made arbitrarily small by choosing n sufficiently large. Hence the result is 
true. 

We can now prove the theorem about the asymptotic procedure, which is often 
called the A method. 


Theorem 5.2.9. Let {X,,} be a sequence of random variables such that 
Vn(X,, — 0) 2 N(0,0?). (5.2.14) 
Suppose the function g(x) is differentiable at 0 and g'(@) #0. Then 
Vi(g(Xn) — 9(8)) > N(0,07(9'())?). (5.2.15) 
Proof: Using expression (5.2.9), we have 


g(Xn) = g(9) + g'(8)(Xn — 9) + 0p(|Xn — 9), 
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where o, is interpreted as in (5.2.10). Rearranging, we have 
Vn(9(Xn) — 9(8)) = 9'(O)Vn(Xn — 8) + Op(Vn| Xn — 4). 


Because (5.2.14) holds, Theorem 5.2.6 implies that \/n|X,, — 6| is bounded in prob- 
ability. Therefore, by Theorem 5.2.8, op(./n|Xn — 6|) — 0, in probability. Hence, 
by (5.2.14) and Theorem 5.2.1, the result follows. m 


Illustrations of the A-method can be found in Example 5.2.8 and the exercises. 


5.2.3 Moment Generating Function Technique 


To find the limiting distribution function of a random variable X,, by using the 
definition obviously requires that we know F'x, (x) for each positive integer n. But 
it is often difficult to obtain Fx, (a) in closed form. Fortunately, if it exists, the 
megf that corresponds to the cdf Fx, (x) often provides a convenient method of 
determining the limiting cdf. 

The following theorem, which is essentially Curtiss’ (1942) modification of a 
theorem of Lévy and Cramér, explains how the mgf may be used in problems of 
limiting distributions. A proof of the theorem is beyond of the scope of this book. It 
can readily be found in more advanced books; see, for instance, page 171 of Breiman 
(1968) for a proof based on characteristic functions. 


Theorem 5.2.10. Let {X,} be a sequence of random variables with mgf Mx,, (t) 


that exists for —h <t<h for alln. Let X be a random variable with mgf M(t), 
which exists for |t| < hy <h. Iflimp—o Mx, (t) = M(t) for |t| < hi, then X, ce 


n 


In this and the subsequent sections are several illustrations of the use of Theorem 
5.2.10. In some of these examples it is convenient to use a certain limit that is 
established in some courses in advanced calculus. We refer to a limit of the form 


n n , 
where 6 and c do not depend upon n and where lim,—.o. U(n) = 0. Then 


lim F + a + — = lim (1 + -) =e. (5.2.16) 
mr nr nm 


noo 


2 2 —n/2 2 42 —n/2 
lim (-5+<,) = lim (54+ 54*) ; 
n 


n  n3/2 n 


Here b = —-t?,c = —4, and y(n) = t?/\/n. Accordingly, for every fixed value of t, 


2. b 
the limit is e/?. 
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Example 5.2.6. Let Y,, have a distribution that is b(n, p). Suppose that the mean 
jt = np is the same for every n; that is, p = u/n, where yz is a constant. We shall 
find the limiting distribution of the binomial distribution, when p = u/n, by finding 
the limit of My, (t). Now 


My, (t) = E(e™) = [(1 — p) + pel] = i + a 


n 
for all real values of t. Hence we have 


lim My, (t) = e#(e-) 


for all real values of t. Since there exists a distribution, namely the Poisson distribu- 
tion with mean p, that has mgf en(e*—1) | then, in accordance with the theorem and 
under the conditions stated, it is seen that Y,, has a limiting Poisson distribution 
with mean p. 

Whenever a random variable has a limiting distribution, we may, if we wish, use 
the limiting distribution as an approximation to the exact distribution function. The 
result of this example enables us to use the Poisson distribution as an approximation 
to the binomial distribution when n is large and p is small. To illustrate the use 
of the approximation, let Y have a binomial distribution with n = 50 and p = x: 
Then, using R for the calculations, we have 


Pr(Y <1)= (33)°° + 50(s5) = pbinom(1,50,1/25) = 0.4004812 
approximately. Since = np = 2, the Poisson approximation to this probability is 
e? + 2e~? = ppois(1,2) = 0.4060058. m 


Example 5.2.7. Let Z,, be x?(n). Then the mef of Z, is (1—2t)-"/?, t < 5. The 
mean and the variance of Z,, are, respectively, n and 2n. The limiting distribution 
of the random variable Y,, = (Zp — n)/W2n will be investigated. Now the megf of 


Yn is 
sn = ofool (S32) 
= e7 tn/V2n Fr (etZn/V2n) 
- exo |- (2) @)] (2g) 


This may be written in the form 


—n/2 
2 
mi = (¢ BU gy |e 7) ; pares 
n 2 


In accordance with Taylor’s formula, there exists a number €(n), between 0 and 
t\/2/n, such that 


z. 4 x\> f(r) a\" 
evr a1eal oto lai) +o if) 
n 2 n 6 n 
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ty/2/n 


If this sum is substituted for e in the last expression for My, (t), it is seen that 


My, (t) = (1 ae — “” ; 


mr mr 


where 


3V/n Jn 3n 


Since €(n) — 0 as n — ov, then lim a(n) = 0 for every fixed value of t. In accordance 
with the limit proposition cited earlier in this section, we have 


JS2t8e&™ — /2t3 D4 eS (”) 
v(n) = , 


for all real values of t. That is, the random variable Y, = (Z, — n)/V2n has a 
limiting standard normal distribution. 

Figure 5.2.2 displays a verification of the asymptotic distribution of the stan- 
dardized Z,,. For each value of n = 5,10,20 and 50, 1000 observations from a 
x?(n)-distribution were generated, using the R command rchisq(1000,n). Each 
observation z, was standardized as yn = (Zn — n)/V2n and a histogram of these 
YnS was computed. On this histogram, the pdf of a standard normal distribution is 
superimposed. Note that at n = 5, the histogram of y, values is skewed, but as n 
increases, the shape of the histogram nears the shape of the pdf, verifying the above 
theory. These plots are computed by the R function cdistplt. In this function, it 
is easy to change values of n for further such plots. m 


Example 5.2.8 (Example 5.2.7, Continued). In the notation of the last example, 
we showed that 


1 1] pD 
Vn Pa 2 — 7, N(0,1). (5.2.17) 


For this situation, though, there are times when we are interested in the square 
root of Z,. Let g(t) = Vt and let Wn = g(Zn/(V2n)) = (Zn/(V2n))'/?. Note that 
g(1/v2) = 1/21/4 and g'(1/V2) = 2-3/4. Therefore, by the A-method, Theorem 
5.2.9, and (5.2.17), we have 


Vin [Wn ~1/2/4| 2 n(0,2-3/2). (5.2.18) 


EXERCISES 


5.2.1. Let X,, denote the mean of a random sample of size n from a distribution 
that is N(,07). Find the limiting distribution of Xp. 


5.2.2. Let Y; denote the minimum of a random sample of size n from a distribution 
that has pdf f(a) = e~-%, 0 < x < oo, zero elsewhere. Let Z, = n(Y; — 0). 
Investigate the limiting distribution of Z,. 
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Figure 5.2.2: For each value of n, a histogram plot of 1000 generated values 
Yn is shown, where y,, is discussed in Example 5.2.7. The limiting N(0,1) pdf is 
superimposed on the histogram. 


5.2.3. Let Y,, denote the maximum of a random sample of size n from a distribution 
of the continuous type that has cdf F(x) and pdf f(#) = F’(«). Find the limiting 
distribution of Z, = n[1 — F(Yn)]. 


5.2.4. Let Y2 denote the second smallest item of a random sample of size n from a 
distribution of the continuous type that has cdf F(a) and pdf f(x) = F’(x). Find 
the limiting distribution of W,, = nF(Y2). 


5.2.5. Let the pmf of Y, be p,(y) = 1, y = n, zero elsewhere. Show that Y,, 
does not have a limiting distribution. (In this case, the probability has “escaped” 
to infinity.) 

5.2.6. Let X1, X2,...,X, be a random sample of size n from a distribution that is 
N(, 07), where o? > 0. Show that the sum Z, = 57 X; does not have a limiting 


distribution. 


5.2.7. Let X,, have a gamma distribution with parameter a = n and (, where 7 is 
not a function of n. Let Y,, = X,,/n. Find the limiting distribution of Y,,. 


5.2.8. Let Z,, be y?(n) and let W, = Z,,/n?. Find the limiting distribution of Wy. 


5.2.9. Let X be y?(50). Using the limiting distribution discussed in Example 5.2.7, 
approximate P(40 < X < 60). Compare your answer with that calculated by R. 
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5.2.10. Modify the R function cdistplt to show histograms of the values wy, 
discussed in Example 5.2.8. 


5.2.11. Let p = 0.95 be the probability that a man, in a certain age group, lives at 
least 5 years. 


(a) If we are to observe 60 such men and if we assume independence, use R to 
compute the probability that at least 56 of them live 5 or more years. 


(b) Find an approximation to the result of part (a) by using the Poisson distri- 
bution. 


Hint: Redefine p to be 0.05 and 1 — p= 0.95. 


5.2.12. Let the random variable Z,, have a Poisson distribution with parameter 
js = n. Show that the limiting distribution of the random variable Y,, = (Zn, —n)/./n 
is normal with mean zero and variance 1. 


5.2.13. Prove Theorem 5.2.3. 


5.2.14. Let X,, and Y,, have a bivariate normal distribution with parameters [11, j2, 
07,03 (free of n) but p = 1—1/n. Consider the conditional distribution of Y,,, given 
X,, = x. Investigate the limit of this conditional distribution as n — oo. What is 
the limiting distribution if p = —1+1/n? Reference to these facts is made in the 
remark of Section 2.5. 


5.2.15. Let X, denote the mean of a random sample of size n from a Poisson 
distribution with parameter yw = 1. 


(a) Show that the mgf of Y, = /n(Xn — w)/o = Vn(Xn — 1) is given by 
exp[—t/n + n(et/V” — 1)}. 


(b) Investigate the limiting distribution of Y, as n— oo. 
Hint: Replace, by its MacLaurin’s series, the expression e’/V”, which is in the 
exponent of the mef of Yy. 


5.2.16. Using Exercise 5.2.15 and the A-method, find the limiting distribution of 
Jn(V Xn —1). 


5.2.17. Let X,, denote the mean of a random sample of size n from a distribution 
that has pdf f(a) =e~*, 0 < x < ~, zero elsewhere. 


(a) Show that the mgf My, (t) of Y, = /n(Xn — 1) is 


My, (t) = [e/V" — (t/Vnje/V")-", t< Vin. 


(b) Find the limiting distribution of Y,, as n— oo. 


Exercises 5.2.15 and 5.2.17 are special instances of an important theorem that will 
be proved in the next section. 


5.2.18. Continuing with Exercise 5.2.17, use the A-method to find the limiting 


distribution of /n(V Xn — 1). 
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5.2.19. Let Yi < Yo <--- < Y, be the order statistics of a random sample (see 
Section 5.2) from a distribution with pdf f(#) = e~” ,0 < x < co,zero elsewhere. 
Determine the limiting distribution of Z,, = (Y, — log n). 


5.2.20. Let Yi < Yo <--- < Y, be the order statistics of a random sample (see 
Section 5.2) from a distribution with pdf f(r) = 5a*,0 < 2 < 1,zero elsewhere. 
Find p so that Z, = n?Y, converges in distribution. 


5.2.21. Consider Stirling’s formula (5.2.2): 


(a) Run the following R code to check this formuala for k = 5 to k = 15. 
ks = 5; kstp = 15; coll = c(Q);for(j in ks:kstp){ 
cl=gamma(j+1); c2=sqrt (2*pi) *exp(-j+(j+.5) *log(j)) 
coll=rbind(coll,c(j,c1,c2))}; coll 


(b) Take the log of Stirling’s formula and compare it with the R computation 
lgamma(k+1). 


(c) Use Stirling’s formula to show that the first limit in Example 5.2.3 is 1. 


5.3 Central Limit Theorem 


It was seen in Section 3.4 that if X1, X2,...,X, is a random sample from a normal 
distribution with mean pz and variance o?, the random variable 


YS Xj — MYL = Vi(Xn me) 
an a 


is, for every positive integer n, normally distributed with zero mean and unit vari- 
ance. In probability theory there is a very elegant theorem called the Central 
Limit Theorem (CLT). A special case of this theorem asserts the remarkable and 
important fact that if X,,X2,...,X,, denote the observations of a random sample 
of size n from any distribution having finite variance 0? > 0 (and hence finite mean 
1), then the random variable /n(X, — )/o converges in distribution to a random 
variable having a standard normal distribution. Thus, whenever the conditions of 
the theorem are satisfied, for large n the random variable /n(Xp, — )/o has an 
approximate normal distribution with mean zero and variance 1. It is then possible 
to use this approximate normal distribution to compute approximate probabilities 
concerning X. 

We often use the notation “Y,, has a limiting standard normal distribution” to 
mean that Y,, converges in distribution to a standard normal random variable; see 
Remark 5.2.1. 

The more general form of the theorem is stated, but it is proved only in the 
modified case. However, this is exactly the proof of the theorem that would be 
given if we could use the characteristic function in place of the mgf. 
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Theorem 5.3.1 (Central Limit Theorem). Let X1, X2,...,X» denote the observa- 
tions of a random sample from a distribution that has mean fu and positive variance 
o?. Then the random variable Y;, = (So;_, Xi — np)/ no = V/n(Xn — p)/o con- 
verges in distribution to a random variable that has a normal distribution with mean 
zero and variance 1. 


Proof: For this proof, additionally assume that the mgf M(t) = E(e’*) exists for 
—h<t<h. If one replaces the mgf by the characteristic function y(t) = E(e"’*), 
which always exists, then our proof is essentially the same as the proof in a more 
advanced course which uses characteristic functions. 
The function 
m(t) = EleX-»)] = e-#* M(t) 

also exists for —h < t < h. Since m(t) is the mgf for X — y, it must follow that 
m(0) = 1, m’(0) = E(X — p) = 0, and m"(0) = E[(X — p)?] = o?. By Taylor’s 
formula there exists a number € between 0 and t such that 


m(t) = m(0)+m’(0)t+ me 
_ 1. nttoe 


If o7t?/2 is added and subtracted, then 


jai 2 


; : (5.3.1) 


Next consider M(t;n), where 


M(tin) = B exp (2a) 


= : , h< : <h 
= i o/n : o/n ; 
In equation (5.3.1), replace t by t/o./n to obtain 
2 fm" 0%}? 
poe | at Ae he 
a () =o , 
where now € is between 0 and t/o./n with —ha./n < t < hov/n. Accordingly, 


M(t;n) = {1 ie cS a i i. 


2n 2no? 
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Since m/’'(t) is continuous at t = 0 and since +0 as n— 0, we have 


lim [m" (€) — 07] = 0. 


noo 


The limit proposition (5.2.16) cited in Section 5.2 shows that 
lim M(t;n) = ee 2 


for all real values of t. This proves that the random variable Y,, = /n(Xn — p)/o 
has a limiting standard normal distribution. ™ 


As cited in Remark 5.2.1, we say that Y,, has a limiting standard normal distri- 
bution. We interpret this theorem as saying that when n is a large, fixed positive 
integer, the random variable X has an approximate normal distribution with mean 
and variance o?/n; and in applications we often use the approximate normal pdf 
as though it were the exact pdf of X. Also, we can equivalently state the conclusion 
of the Central Limit Theorem as 


Vn(X — 2p) > N(0,0?). (5.3.2) 


This is often a convenient formulation to use. 

One of the key applications of the Central Limit Theorem is for statistical infer- 
ence. In Examples 5.3.1—5.3.6, we present results for several such applications. As 
we point out, we made use of these results in Chapter 4, but we will also use them 
in the remainder of the book. 


Example 5.3.1 (Large Sample Inference for yw). Let X1, X2,...,Xy be a ran- 
dom sample from a distribution with mean pz and variance o?, where p and o? 
are unknown. Let X and S be the sample mean and sample standard deviation, 
respectively. Then 


>| 


—H 4 N(0,1). (5.3.3) 


> 


/ 


To see this, write the left side as 


Se 2 (X — n) 

S//n SJ afin — 

Example 5.1.1 shows that S converges in probability to o and, hence, by the theo- 
rems of Section 5.2, that o/S converges in probability to 1. Thus the result (5.3.3) 
follows from the CLT and Slutsky’s Theorem, Theorem 5.2.5. 


In Examples 4.2.2 and 4.5.3 of Chapter 4, we presented large sample confidence 
intervals and tests for ys based on (5.3.3). m 


Some illustrative examples, here and below, help show the importance of this 
version of the CLT. 
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Example 5.3.2. Let X denote the mean of a random sample of size 75 from the 
distribution that has the pdf 


fe) ={ 1 0<e2<1 


0 elsewhere. 


For this situation, it can be shown that the pdf of X, g(%), has a graph when 
0 < % < 1 that is composed of arcs of 75 different polynomials of degree 74. The 
computation of such a probability as P(0.45 < X < 0.55) would be extremely 
laborious. The conditions of the theorem are satisfied, since M(t) exists for all real 


values of t. Moreover, 4 = - and o? = a so that using R we have approximately 


p | v0.45 =) — V(X =n) _ vn(0.55 = 1) 


(or on (on 


P(0.45 < X < 0.55) 


P[-1.5 < 30(X¥ — 0.5) < 1.5] 
pnorm(1.5) — pnorm(—1.5) = 0.8663. 


l| 


2 


Example 5.3.3 (Normal Approximation to the Binomial Distribution). Suppose 
that X1, X2,...,Xp, is a random sample from a distribution that is b(1,p). Here 
=p, 0? = p(1—p), and M(t) exists for all real values of t. If Y, = X1 +--+ Xn, 
it is known that Y, is b(n,p). Calculations of probabilities for Y,,, when we do 
not use the Poisson approximation, are simplified by making use of the fact that 
(Y, —np)/./npQl — p) = Vn(Xn — p)//p(1 — p) = Vn(Xn — w)/o has a limiting 
distribution that is normal with mean zero and variance 1. 

Frequently, statisticians say that Y,,, or more simply Y, has an approximate 
normal distribution with mean np and variance np(1 — p). Even with n as small 
as 10, with p = 4 so that the binomial distribution is symmetric about np = 5, 
we note in Figure 5.3.1 how well the normal distribution, N(5, 3), fits the binomial 
distribution, b(10, 5) where the heights of the rectangles represent the probabilities 
of the respective integers 0,1,2,...,10. Note that the area of the rectangle whose 
base is (Kk — 0.5,4 + 0.5) and the area under the normal pdf between k — 0.5 and 
k+0.5 are approximately equal for each k = 0,1,2,...,10, even with n = 10. This 
example should help the reader understand Example 5.3.4. 


Example 5.3.4. With the background of Example 5.3.3, let n = 100 and p = 5, 
and suppose that we wish to compute P(Y = 48, 49, 50,51,52). Since Y is a random 
variable of the discrete type, {Y = 48,49,50,51,52} and {47.5 < Y < 52.5} are 
equivalent events. That is, P(Y = 48,49,50,51,52) = P(47.5 < Y < 52.5). Since 
np = 50 and np(1 — p) = 25, the latter probability may be written 


l| 


P(47.5 < Y < 52.5) 5 Se 5 


P (-05 < —* € 0.5) 


P a Y — 50 a 


l 
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Figure 5.3.1: The b (10, 5) pmf overlaid by the N (5, 3) pdf. 


Since (Y — 50)/5 has an approximate normal distribution with mean zero and vari- 
ance 1, the probability is approximately pnorm(.5)-pnorm(-.5)= 0.3829. 

The convention of selecting the event 47.5 < Y < 52.5, instead of another event, 
say, 47.8 < Y < 52.3, as the event equivalent to the event Y = 48, 49, 50,51, 52 is 
due to the following observation. The probability P(Y = 48, 49,50,51,52) can be 
interpreted as the sum of five rectangular areas where the rectangles have widths 
1 and the heights are respectively P(Y = 48),...,P(Y = 52). If these rectangles 
are so located that the midpoints of their bases are, respectively, at the points 
48,49,...,52 on a horizontal axis, then in approximating the sum of these areas 
by an area bounded by the horizontal axis, the graph of a normal pdf, and two 
ordinates, it seems reasonable to take the two ordinates at the points 47.5 and 52.5. 
This is called the continuity correction. 


We next present two examples concerning large sample inference for proportions. 


Example 5.3.5 (Large Sample Inference for Proportions). Let X1,X2,...,Xn be 
a random sample from a Bernoulli distribution with p as the probability of success. 
Let p be the sample proportion of successes. Then p = X. Hence, 


p-p 


eT 2, N(0,1). (5.3.4) 
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This is readily established by using the CLT and the same reasoning as in Example 
5.3.1; see Exercise 5.3.13. 

In Examples 4.2.3 and 4.5.2 of Chapter 4, we presented large sample confidence 
intervals and tests for p using (5.3.4). @ 


Example 5.3.6 (Large Sample Inference for y?-Tests). Another extension of Ex- 
ample 5.3.3 that was used in Section 4.7 follows quickly from the Central Limit 
Theorem and Theorem 5.2.4. Using the notation of Example 5.3.3, suppose Y, 
has a binomial distribution with parameters n and p. Then, as in Example 5.3.3, 
(Yn, — np)/./np(1 — p) converges in distribution to a random variable Z with the 
N(0,1) distribution. Hence, by Theorem 5.2.4, 


(a) 421). (5.3.5) 


np(1 —p 
This was the result referenced in Chapter 4; see expression (4.7.1). m 


We know that X and )>} X; have approximately normal distributions, provided 
that n is large enough. Later, we find that other statistics also have approximate 
normal distributions, and this is the reason that the normal distribution is so impor- 
tant to statisticians. That is, while not many underlying distributions are normal, 
the distributions of statistics calculated from random samples arising from these 
distributions are often very close to being normal. 

Frequently, we are interested in functions of statistics that have approximately 
normal distributions. To illustrate, consider the sequence of random variable Y,, of 
Example 5.3.3. As discussed there, Y, has an approximate N[np,np(1— p)]. So 
np(1 — p) is an important function of p, as it is the variance of Y,. Thus, if p is 
unknown, we might want to estimate the variance of Y,,. Since E(Y;,,/n) = p, we 
might use n(Y;,/n)(1 — Y;,/n) as such an estimator and would want to know some- 
thing about the latter’s distribution. In particular, does it also have an approximate 
normal distribution? If so, what are its mean and variance? To answer questions 
like these, we can apply the A-method, Theorem 5.2.9. 

As an illustration of the A-method, we consider a function of the sample mean. 
Assume that X1,...,X, is arandom sample on X which has finite mean jz and vari- 
ance o?. Then rewriting expression (5.3.2) we have by the Central Limit Theorem 
that 

Vn(X — p) > N(0,c?). 


Hence, by the A-method, Theorem 5.2.9, we have 


Vailg(X) — 9(u)] > N(0,07(g'(H))?), (5.3.6) 


for a continuous transformation g(x) such that g/() 4 0. 


Example 5.3.7. Assume that we are sampling from a binomial b(1, p) distribution. 
Then X is the sample proportion of successes. Here « = p and o? = p(1 — p). 
Suppose that we want a transformation g(p) such that the transformed asymptotic 
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variance is constant; in particular, it is free of p. Hence, we seek a transformation 


g(p) such that 
c 


/ 
g(~) =§ 
Vp — p) 
for some constant c. Integrating both sides and making the change-of-variables 
z=p, dz =1/(2,/p) dp, we have 


1 
g(p) = © i ——— ip 
Vp(1 — p) 
1 
2c | ———— dz = 2earcsin(z) = 2carcsin . 
las (2 (VP 
Taking c = 1/2, for the statistic g (X) = arcsin (v x), we obtain 


Ja [aresin (vx) — arcsin (vB)] 2 oN (0 3) 


Several other such examples are given in the exercises. ™ 


EXERCISES 


5.3.1. Let X denote the mean of a random sample of size 100 from a distribution 
that is y?(50). Compute an approximate value of P(49 < X < 51). 


5.3.2. Let X denote the mean of a random sample of size 128 from a gamma 
distribution with a = 2 and G = 4. Approximate P(7 < X < 9). 


5.3.3. Let Y be 6(72, 5). Approximate P(22 < Y < 28). 


5.3.4. Compute an approximate probability that the mean of a random sample of 
size 15 from a distribution having pdf f(x) = 3x7, 0 < x < 1, zero elsewhere, is 
between : and 2. 


5.3.5. Let Y denote the sum of the observations of a random sample of size 12 from 
a distribution having pmf p(x) = z. x = 1,2,3,4,5,6, zero elsewhere. Compute an 
approximate value of P(36 < Y < 48). 

Hint: Since the event of interest is Y = 36,37,...,48, rewrite the probability as 
P(35.5 << Y < 48.5). 


5.3.6. Let Y be b(400, z). Compute an approximate value of P(0.25 < Y/400). 
5.3.7. If Y is b(100, 5), approximate the value of P(Y = 50). 


5.3.8. Let Y be b(n, 0.55). Find the smallest value of n such that (approximately) 
P(Y/n > $) > 0.95. 
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5.3.9. Let f(x) = 1/27, 1 < x < o, zero elsewhere, be the pdf of a random 
variable X. Consider a random sample of size 72 from the distribution having this 
pdf. Compute approximately the probability that more than 50 of the observations 
of the random sample are less than 3. 


5.3.10. Forty-eight measurements are recorded to several decimal places. Each of 
these 48 numbers is rounded off to the nearest integer. The sum of the original 48 
numbers is approximated by the sum of these integers. If we assume that the errors 
made by rounding off are iid and have a uniform distribution over the interval 
(—4, 4), compute approximately the probability that the sum of the integers is 
within two units of the true sum. 


5.3.11. We know that X is approximately N(y,0?/n) for large n. Find the ap- 
proximate distribution of u(X) = ar provided that up 4 0. 


5.3.12. Let X1, X2,...,X, be a random sample from a Poisson distribution with 
mean pt. Thus, Y = >; X; has a Poisson distribution with mean nj. Moreover, 
X = Y/n is approximately N(,/n) for large n. Show that u(Y/n) = /Y/n is a 
function of Y/n whose variance is essentially free of ju. 


5.3.13. Using the notation of Example 5.3.5, show that equation (5.3.4) is true. 


5.3.14. Assume that X1,...,X, is a random sample from a I'(1, 3) distribution. 
Determine the asymptotic distribution of \/n(X — 3). Then find a transformation 
g(X) whose asymptotic variance is free of (3. 


5.4 *Extensions to Multivariate Distributions 


In this section, we briefly discuss asymptotic concepts for sequences of random 
vectors. The concepts introduced for univariate random variables generalize in a 
straightforward manner to the multivariate case. Our development is brief, and 
the interested reader can consult more advanced texts for more depth; see Serfling 
(1980). 

We need some notation. For a vector v € R?, recall the Euclidean norm of v is 
defined to be 


(5.4.1) 


This norm satisfies the usual three properties given by 


(a) For all v € R?, ||v|| > 0, and ||v|| = 0 if and only if v = 0. 
(b) For all v € R? andae R, |\av|| = |a|||v||. (5.4.2) 
(c) For all v,u € RP, |ju+ vi] < |jul| + || vj]. 


Denote the standard basis of R? by the vectors e1,...,@p), where all the components 
of e; are 0 except for the 7th component, which is 1. Then we can write any vector 
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Pp 
v= ) Uj{Gj. 
i=1 


The following lemma will be useful: 


Lemma 5.4.1. Let v! = (v1,...,Up) be any vector in R?. Then 
lvy| < IIvl < Solu, for allj =1,...,p. (5.4.3) 
i=1 


Proof: Note that for all j, 
P 
vf <0? = IIvl?: 
i=1 


hence, taking the square root of this equality leads to the first part of the desired 
inequality. The second part is 


Pp 
y Vii 
i=1 


P 


p 
<> jellies] = So lei). 
j= 1 


i=l 


IIvll = 


Let {X,,} denote a sequence of p-dimensional vectors. Because the absolute 
value is the Euclidean norm in R?, the definition of convergence in probability for 
random vectors is an immediate generalization: 


Definition 5.4.1. Let {X,,} be a sequence of p-dimensional vectors and let X be a 
random vector, all defined on the same sample space. We say that {X,,} converges 
in probability to X af 

im. P{||Xn — X|| > €] = 0, (5.4.4) 


for alle > 0. As in the univariate case, we write Xp, aie 


As the next theorem shows, convergence in probability of vectors is equivalent 
to componentwise convergence in probability. 


Theorem 5.4.1. Let {X,,} be a sequence of p-dimensional vectors and let X be a 
random vector, all defined on the same sample space. Then 


pe 4 if and only if Nag IG fOr Obl 9 = Vege 5D: 


Proof: This follows immediately from Lemma 5.4.1. Suppose X, *.X. For any J, 
from the first part of the inequality (5.4.3), we have, for « > 0, 


Cee aap eee 


Hence _ —_ 
limy—sooP[|Xnj — X;j| > e] < limp PI||Kn — X|| > €] = 0, 


350 Consistency and Limiting Distributions 


which is the desired result. 
Conversely, if Xn; Ee X; for all 7 = 1,...,p, then by the second part of the 
inequality (5.4.3), 


Pp 
€< [Kn — Xl] < $0 |Xnj — X51, 
i=1 


for any e > 0. Hence 
— — Pp 
Timp ooP[\|Xn —XI| >) < TmpooP[S > |Xnj — Xj] >] 
j=l 


Pp 
So Timp soo P[|[Xnj — Xj] > €/p]=0. 


j=l 


IA 


Based on this result, many of the theorems involving convergence in probability 
can easily be extended to the multivariate setting. Some of these results are given 
in the exercises. This is true of statistical results, too. For example, in Section 
5.2, we showed that if X1,...,X, is a random sample from the distribution of a 
random variable X with mean, jz, and variance, 0”, then X,, and S$? are consistent 
estimates of j: and o?. By the last theorem, we have that (Xp, $7) is a consistent 
estimate of (1,07). 

As another simple application, consider the multivariate analog of the sample 
mean and sample variance. Let {X,,} be a sequence of iid random vectors with 
common mean vector w and variance-covariance matrix 4. Denote the vector of 
means by 


5 ees yx (5.4.5) 


Of course, X,, is just the vector of sample means, (X1,..., Xp)’. By the Weak Law 
of Large Numbers, Theorem 5.1.1, xX; — pj, in probability, for each 7. Hence, by 
Theorem 5.4.1, X,, — ps, in probability. 

How about the analog of the sample variances? Let X; = (Xi1,..., Xip)’. Define 
the sample variances and covariances by 


n 


1 a 
2 = 2 es 
3 = Bad Kes — Fi)" for j =1,...,p, (5.4.6) 
it << — = 
Snijk = — DL (Xig — X5)(Xix — Xe), for j Ak=1,...,p. (5.4.7) 
w=1 


Assuming finite fourth moments, the Weak Law of Large Numbers shows that all 
these componentwise sample variances and sample covariances converge in proba- 
bility to distribution variances and covariances, respectively. As in our discussion 
after the Weak Law of Large Numbers, the Strong Law of Large Numbers implies 
that this convergence is true under the weaker assumption of the existence of finite 
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second moments. If we define the p x p matrix S to be the matrix with the jth 
diagonal entry Se 5 and (j,k)th entry S;,,j;,, then S — %, in probability. 

The definition of convergence in distribution remains the same. We state it here 
in terms of vector notation. 


Definition 5.4.2. Let {X,,} be a sequence of random vectors with X, having dis- 
tribution function F;,(x) and X be a random vector with distribution function F(x). 
Then {X,,} converges in distribution to X if 


lim Fi, (x) = F(x), (5.4.8) 


n—oo 


for all points x at which F(x) is continuous. We write Xp, 2, 


In the multivariate case, there are analogs to many of the theorems in Section 
5.2. We state two important theorems without proof. 


Theorem 5.4.2. Let {X,,} be a sequence of random vectors that converges in dis- 
tribution to a random vector X and let g(x) be a function that is continuous on the 
support of X. Then g(X,) converges in distribution to g(X). 


We can apply this theorem to show that convergence in distribution implies 
marginal convergence. Simply take g(x) = x;, where x = (a,...,Zp)’. Since g is 
continuous, the desired result follows. 

It is often difficult to determine convergence in distribution by using the defini- 
tion. As in the univariate case, convergence in distribution is equivalent to conver- 
gence of moment generating functions, which we state in the following theorem. 


Theorem 5.4.3. Let {X,,} be a sequence of random vectors with X, having distri- 
bution function F,(x) and moment generating function M,,(t). Let X be a random 
vector with distribution function F(x) and moment generating function M(t). Then 
{X,,} converges in distribution to X if and only if, for some h > 0, 

lim M,,(t) = M(t), (5.4.9) 


for all t such that ||t|| < h. 


The proof of this theorem can be found in more advanced books; see, for instance, 
Tucker (1967). Also, the usual proof is for characteristic functions instead of moment 
generating functions. As we mentioned previously, characteristic functions always 
exist, so convergence in distribution is completely characterized by convergence of 
corresponding characteristic functions. 

The moment generating function of X, is E[exp{t’X,,}]. Note that t’/X,, is a 
random variable. We can frequently use this and univariate theory to derive results 
in the multivariate case. A perfect example of this is the multivariate central limit 
theorem. 


Theorem 5.4.4 (Multivariate Central Limit Theorem). Let {X,} be a sequence 
of tid random vectors with common mean vector wu and variance-covariance matria 
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» which is positive definite. Assume that the common moment generating function 
M(t) exists in an open neighborhood of 0. Let 


Yo = Fe w) = Va 1. 


Then Y,, converges in distribution to a N,(0,%) distribution. 


Proof: Let t € R? be a vector in the stipulated neighborhood of 0. The moment 
generating function of Y,, is 


M,,(t) 


| 
& 


l| 
& 


(5.4.10) 


where W; = t/(X; — pw). Note that W,,...,W,, are iid with mean 0 and variance 
Var(W;) = t’=t. Hence, by the simple Central Limit Theorem, 


1 D ji 
— ; N08 Det). A11 
ae aie ) (5.4.11) 


Expression (5.4.10), though, is the mef of (1/,/n) >", W; evaluated at 1. There- 
fore, by (5.4.11), we must have 


Lt , ’ 
exp {Oe swt 5 elt Et/2 _ ot’ Bt /2, 
i=1 


Because the last quantity is the moment generating function of a N,(0,™) distri- 
bution, we have the desired result. m 


M,(t) = E 


Suppose X,,Xo,...,X, is a random sample from a distribution with mean 
vector yz and variance-covariance matrix ©. Let X,, be the vector of sample means. 
Then, from the Central Limit Theorem, we say that 


X,, has an approximate N, (He, +) distribution. (5.4.12) 


A result that we use frequently concerns linear transformations. Its proof is 
obtained by using moment generating functions and is left as an exercise. 
Theorem 5.4.5. Let {X,,} be a sequence of p-dimensional random vectors. Suppose 
iG = N(p,h). Let A be an m x p matrix of constants and let b be an m- 
dimensional vector of constants. Then AX,, +b Z N(Ap+b, ADA’). 
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A result that will prove to be quite useful is the extension of the A-method; see 


Theorem 5.2.9. A proof can be found in Chapter 3 of Serfling (1980). 


Theorem 5.4.6. Let {X,,} be a sequence of p-dimensional random vectors. Suppose 


Vii(Xn — fo) > Np(0,B). 


Let g be a transformation g(x) = (g1(x),.--,9n(x))! such that 1 <k < p and the 
k x p matrix of partial derivatives, 


B= EAE t= leaks FS Nyec Ds 
Op; 


are continuous and do not vanish in a neighborhood of Uo. Let Bo = B at po. Then 


Vn(g(Xn) — g(t) > Ni(0, BoXBS). (5.4.13) 


EXERCISES 

5.4.1. Let {X,,} be a sequence of p-dimensional random vectors. Show that 
x, N,(p, &) if and only if a’X,, 2, Ny (a! p,a’Za), 

for all vectors a € RP. 


5.4.2. Let X1,...,X, be a random sample from a uniform(a, b) distribution. Let 
Y, = min X; and let Yo = max X;. Show that (Yi, Y2)/ converges in probability to 
the vector (a, b)’. 
5.4.3. Let X,, and Y, be p-dimensional random vectors. Show that if 

X, —Y, 0 and X, 2X, 


: . ; D 
where X is a p-dimensional random vector, then Y,, > X. 


5.4.4. Let X,, and Y,, be p-dimensional random vectors such that X,, and Y,, are 
independent for each n and their mgfs exist. Show that if 


X, >X and Y, 2Y, 


where X and Y are p-dimensional random vectors, then (Kn, Yn) Z (X, Y). 


5.4.5. Suppose X,, has a N,(t,,, Un) distribution. Show that 


Xp S Np(p, 5) iff p,, > pp and E, > &. 


This page intentionally left blank 


Chapter 6 


Maximum Likelihood 
Methods 


6.1 Maximum Likelihood Estimation 


Recall in Chapter 4 that as a point estimation procedure, we introduced maximum 
likelihood estimates (mle). In this chapter, we continue this development showing 
that these likelihood procedures give rise to a formal theory of statistical inference 
(confidence and testing procedures). Under certain conditions (regularity condi- 
tions), these procedures are asymptotically optimal. 

As in Section 4.1, consider a random variable X whose pdf f(x;@) depends on 
an unknown parameter @ which is in a set Q. Our general discussion is for the 
continuous case, but the results extend to the discrete case also. For information, 
suppose that we have a random sample Xj,...,X, on X; ie., Xy,...,Xp are iid 
random variables with common pdf f(#;6),@ € Q. For now, we assume that 6 
is a scalar, but we do extend the results to vectors in Sections 6.4 and 6.5. The 
parameter @ is unknown. The basis of our inferential procedures is the likelihood 


function given by 
n 


L(6;x) =] | f(ais9), 9 €Q, (6.1.1) 
i=1 
where x = (21,...,%n)/. Because we treat L as a function of 6 in this chapter, we 


have transposed the x; and @ in the argument of the likelihood function. In fact, we 
often write it as L(@). Actually, the log of this function is usually more convenient 
to use and we denote it by 


1(0) = log L(@) = y lowed), dE. (6.1.2) 


Note that there is no loss of information in using /(@) because the log is a one-to-one 
function. Most of our discussion in this chapter remains the same if X is a random 
vector. 
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As in Chapter 4, our point estimator of @ is = 6(X1, ...;Xn), where @ max- 
imizes the function L(@). We call @ the maximum likelihood estimator (mle) of 0. 
In Section 4.1, several motivating examples were given, including the binomial and 
normal probability models. Later we give several more examples, but first we offer 
a theoretical justification for considering the mle. Let @ denote the true value of 0. 
Theorem 6.1.1 shows that the maximum of L(@) asymptotically separates the true 
model at 0) from models at 9 4 69. To prove this theorem, certain assumptions, 
regularity conditions, are required. 


Assumptions 6.1.1 (Regularity Conditions). Regularity conditions (RO)—(R2) are 
(RO) The cdfs are distinct; i.e., 04 OW => F(ai;0) A F(a; 0’). 

(R1) The pdfs have common support for all 0. 

(R2) The point 09 is an interior point in Q. 


The first assumption states that the parameter identifies the pdf. The second as- 
sumption implies that the support of X; does not depend on @. This is restrictive, 
and some examples and exercises cover models in which (R1) is not true. 


Theorem 6.1.1. Assume that 0p is the true parameter and that Eo,(f (Xi; 0)/f (Xi; 9)| 
exists. Under assumptions (RO) and (R1), 


lim Po,[L(00,X) > L(0,X)| =1, for all 0 F Oo. (6.1.3) 


Proof: By taking logs, the inequality L(@9,X) > L(@,X) is equivalent to 


7 oes F fo | <0. 


Ao 


Since the summands are iid with finite expectation and the function ¢(x) = — log(z) 
is strictly convex, it follows from the Law of Large Numbers (Theorem 5.1.1) and 
Jensen’s inequality (Theorem 1.10.5) that, when 4 is the true parameter, 


wh xeay] * ® [Meera] <M [7c 


X1; 
ve lf ecole [Fant a 


Because log1 = 0, the theorem follows. Note that common support is needed to 
obtain the last equalities. m 


But 


Theorem 6.1.1 says that asymptotically the likelihood function is maximized at 
the true value #9. So in considering estimates of 69, it seems natural to consider the 
value of 6 that maximizes the likelihood. 
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n~ 


Definition 6.1.1 (Maximum Likelihood Estimator). We say that 6 = 6(X) is a 
maximum likelihood estimator (mle) of 0 if 


6 = Argmaz L(0; X). (6.1.4) 
The notation Argmaxz means that L(0;X) achieves its maximum value at 6. 


As in Chapter 4, to determine the mle, we often take the log of the likelihood 
and determine its critical value; that is, letting 1(0) = log L(@), the mle solves the 
equation 

Ol(0) 

007 
This is an example of an estimating equation, which we often label as an EE. 
This is the first of several EEs in the text. 


=0. (6.1.5) 


Example 6.1.1 (Laplace Distribution). Let X1,...,X, be iid with density 
u —|xz—6| 
F(x; 0) = Se , OO <B<0w,-~0< I<. (6.1.6) 


This pdf is referred to as either the Laplace or the double exponential distribution. 
The log of the likelihood simplifies to 


1(9) = —nlog 2—S~ |a; — 4). 
i=1 
The first partial derivative is 
'(0) = S° sgn(«; — 8), (6.1.7) 
i=1 


where sgn(t) = 1,0, or — 1 depending on whether ¢t > 0,t = 0, or t < 0. Note that 
we have used 4|t| = sgn(t), which is true unless t = 0. Setting equation (6.1.7) to 0, 
the solution for @ is med{21, 22,...,2n}, because the median makes half the terms 
of the sum in expression (6.1.7) nonpositive and half nonnegative. Recall that we 
defined the sample median in expression (4.4.4) and that we denote it by Q2 (the 
second quartile of the sample). Hence, @= Qe is the mle of @ for the Laplace pdf 
(6.1.6). m 


There is no guarantee that the mle exists or, if it does, it is unique. This is often 
clear from the application as in the next two examples. Other examples are given 
in the exercises. 


Example 6.1.2 (Logistic Distribution). Let X1,...,X, be iid with density 


exp{—(a — 6)} 


iO Ty expt OF 


—0 <2“4<oo, -cO<dI< om. (6.1.8) 
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The log of the likelihood simplifies to 


n 


(0) = > bg f (ai; 0) = n0 — nt — 25° log(1 + exp{—(a; — 6)}). 


i=l 


Using this, the first partial derivative is 


; “ xp{—(a; — 0 
(0) =n— 28 erent (6.1.9) 


Setting this equation to 0 and rearranging terms results in the equation 


“exp{-(ai- 9} on 
> T+exp{—(@%;— 6} 2” (6.1.10) 


Although this does not simplify, we can show that equation (6.1.10) has a unique 
solution. The derivative of the left side of equation (6.1.10) simplifies to 


" exp{—(ei-6)} Ch __exp{—(zi — 9)} 
Ae fe8 » 1+exp{—(a,—0)} » iia 


Thus the left side of equation (6.1.10) is a strictly increasing function of 6. Finally, 
the left side of (6.1.10) approaches 0 as 6 — —oo and approaches n as @ — ov. 
Thus equation (6.1.10) has a unique solution. Also, the second derivative of 1() is 
strictly negative for all 0; hence, the solution is a maximum. 

Having shown that the mle exists and is unique, we can use a numerical method 
to obtain the solution. In this case, Newton’s procedure is useful. We discuss this 
in general in the next section, at which time we reconsider this example. m 


Example 6.1.3. In Example 4.1.2, we discussed the mle of the probability of 
success 0 for a random sample X,, X2,...,X» from the Bernoulli distribution with 
pmf 
g*(1—0)!-* 2 =0,1 
p(x) = { ee) 


0 elsewhere, 


where 0 < 6 < 1. Recall that the mle is X, the proportion of sample successes. 
Now suppose that we know in advance that, instead of 0 < @ < 1, @ is restricted 
by the inequalities 0 < 6 < 1/3. If the observations were such that 7 > 1/3, then 
=z would not be a satisfactory estimate. Since ea) > 0, provided 0 < %, under the 


restriction 0 < 0 < 1/3, we can maximize |(@) by taking 6 = min {z, 5}. 
The following is an appealing property of maximum likelihood estimates. 


Theorem 6.1.2. Let X1,...,Xn be iid with the pdf f(x;0),@ © Q. For a specified 
function g, let 7 = g(@) be a parameter of interest. Suppose 0 is the mle of 0. Then 
g(0) is the mle of n = g(@). 
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Proof: First suppose g is a one-to-one function. The likelihood of interest is L(g(0)), 
but because g is one-to-one, 


mae (GO) = mney T(y) —mnast L(g~*(n)). 


But the maximum occurs when g~!(7) = 0; i.e., take j= g(0). 

Suppose g is not one-to-one. For each 7 in the range of g, define the set (preim- 
age) 

g*(n) = {8: 9(9) = nh. 

The maximum occurs at 6 and the domain of g is Q, which covers 0. Hence, @ is 
in one of these preimages and, in fact, it can only be in one preimage. Hence to 
maximize L(7), choose 7 so that g~'(7) is that unique preimage containing 6. Then 
7 = 9(0). 


Consider Example 4.1.2, where X1,...,X, are iid Bernoulli random variables 
with probability of success p. As shown in this example, p = X is the mle of p. 
Recall that in the large sample confidence interval for p, (4.2.7), an estimate of 

p(1 — p) is required. By Theorem 6.1.2, the mle of this quantity is \/p(1 — p). 

We close this section by showing that maximum likelihood estimators, under 

regularity conditions, are consistent estimators. Recall that X’ = (Xq,...,Xn). 


Theorem 6.1.3. Assume that X1,...,Xn satisfy the regularity conditions (RO) 
through (R2), where 09 is the true parameter, and further that f(x;0@) is differen- 
tiable with respect to @ in Q. Then the likelihood equation, 


L(@) = 
00 a 
or equivalently 

() 

936 (9) =0, 


has a solution On such that On, za Ao. 


Proof: Because 9 is an interior point in 2, (89 — a,09 + a) C Q, for some a > 0. 
Define S,, to be the event 


Sn = {X: 1(A9;X) > (00 — a; X)} N{X: 1(Oo; KX) > 1A +a; X)}. 


By Theorem 6.1.1, P(S;,) — 1. So we can restrict attention to the event S,. But 
on S,,, 1(@) has a local maximum, say, 9,, such that 09 —a < 0, < 09 +a and 


n~ 


(On) = 0. That is, 
Sm {X= [On(X) — Oo] < a} n {K: 1G, (X)) =o}. 
Therefore, 


1= lim P(S,) < Im P {x: lOn(X) — Oo] < al n {x: '(6,(X)) = o}| <1; 


noo n—oo 
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see Remark 5.2.3 for discussion on lim. It follows that for the sequence of solutions 
On, Pl|O@n — 90| < a] > 1. 

The only contentious point in the proof is that the sequence of solutions might 
depend on a. But we can always choose a solution “closest” to 69 in the following 
way. For each n, the set of all solutions in the interval is bounded; hence, the 
infimum over solutions closest to 99 exists. ™ 


Note that this theorem is vague in that it discusses solutions of the equation. 
If, however, we know that the mle is the unique solution of the equation l’(@) = 0, 
then it is consistent. We state this as a corollary: 


Corollary 6.1.1. Assume that Xy,...,Xn satisfy the regularity conditions (RO) 
through (R2), where 09 is the true parameter, and that f(a;0@) is differentiable with 
respect to @ in Q. Suppose the likelihood equation has the unique solution On. Then 
A, is a consistent estimator of A. 


EXERCISES 


6.1.1. Let X1,X2,...,X» be a random sample on X that has a [(a = 4,3 = @) 
distribution, 0 < 6 < oo. 


(a) Determine the mle of 6. 


(b) Suppose the following data is a realization (rounded) of a random sample on 
X. Obtain a histogram with the argument pr=T (data are in ex6111.rda). 
9 39 38 23 8 47 21 22 18 10 17 22 14 
9 5 26 11 31 15 25 9 29 28 19 8 


(c) For this sample, obtain 6 the realized value of the mle and locate 46 on the 
histogram. Overlay the [(a = 4, 6 = 0) pdf on the histogram. Does the data 
agree with this pdf? Code for overlay: 

xs=sort (x) ; y=dgamma(xs,4,1/betahat) ;hist(x,pr=T) ;lines(y~xs). 


6.1.2. Let X1, X2,..., Xn represent a random sample from each of the distributions 
having the following pdfs: 


(a) f(x;0) = 029-1, 0<a2 <1, 0 <0 < , zero elsewhere. 


(b) f(2;0) = e—@-9, 0< & < «wo, —00 < 0 < ~, zero elsewhere. Note that this 
is a nonregular case. 


In each case find the mle 6 of 6. 


6.1.3. Let ¥; < Yo <--- < Y, be the order statistics of a random sample from a 

distribution with pdf f(a; 6) = 1, 0-4 <@2<0+ 4, —co <0 < ~, zero elsewhere. 

This is a nonregular case. Show that every statistic u(X1, X2,...,Xn) such that 
Yn — 


5 <u(X1, Xo,...,Xn) <Yi+3 
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is a mle of 0. In particular, (4Y; + 2Y, + 1)/6, (¥%14+ Y,)/2, and (2Y; + 4Y, — 1)/6 
are three such statistics. Thus, uniqueness is not, in general, a property of mles. 


6.1.4. Suppose X1,...,Xp are iid with pdf f(xz;0) = 27/07, 0 < x < 6, zero 
elsewhere. Note this is a nonregular case. Find: 


(a) The mle 6 for 0. 
(b) The constant c so that E(c6) = 0. 


(c) The mle for the median of the distribution. Show that it is a consistent 
estimator. 


6.1.5. Consider the pdf in Exercise 6.1.4. 
(a) Using Theorem 4.8.1, show how to generate observations from this pdf. 


(b) The following data were generated from this pdf. Find the mles of # and the 
median. 
1.27.7 4.3 4.1 7.1 6.3 5.3 6.3 5.3 2.8 
3.8 7.0 4.5 5.0 6.3 6.7 5.0 7.4 7.5 7.5 


6.1.6. Suppose X,, Xo,...,Xp are iid with pdf f(x;0) = (1/0)e*/", 0< 4 < ~, 
zero elsewhere. Find the mle of P(X < 2) and show that it is consistent. 


6.1.7. Let the table 


x 0 1 2 3 4 5 
Frequency |6 10 14 13 6 1 


represent a summary of a sample of size 50 from a binomial distribution having 
n =5. Find the mle of P(X > 3). For the data in the table, using the R function 
pbinom determine the realization of the mle. 


6.1.8. Let X,, Xo, X3, X4, X5 be a random sample from a Cauchy distribution with 
median 0, that is, with pdf 


1 a 


f(x; 4) = 71+ (e@—0)?2’ 


-oO<zr< Ow, 


where —oo < 0 < oo. Suppose 7; = —1.94, rg = 0.59, v3 = —5.98, x4 = —0.08, 
and x5 = —0.77. 


(a) Show that the mle can be obtained by minimizing 


5 
a log[1 + (a; — 6)°). 
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(b) Approximate the mle by plotting the function in Part (a). Make use of the 
following R code which assumes that the data are in the R vector x: 
theta=seq(-6,6, .001) ;lfs<-c() 
for(th in theta) {lfs=c(lfs,sum(log((x-th)*2+1)))} 
plot (1fs~theta) 


6.1.9. Let the table 


x 0 1 2 3 4 °5 
Frequency | 7 14 12 13 6 3 


represent a summary of a random sample of size 55 from a Poisson distribution. 
Find the maximum likelihood estimator of P(X = 2). Use the R function dpois to 
find the estimator’s realization for the data in the table. 


6.1.10. Let X1, X2,...,X, be a random sample from a Bernoulli distribution with 
parameter p. If p is restricted so that we know that = <p <1, find the mle of this 
parameter. 


6.1.11. Let X1, X2,..., Xn be arandom sample from a N (0, 07) distribution, where 
oa? is fixed but —oo < @ < oo. 


(a) Show that the mle of 0 is X. 
(b) If @ is restricted by 0 < 0 < co, show that the mle of 0 is 6 = max{0, X}. 


6.1.12. Let X1, X2,...,Xn bea random sample from the Poisson distribution with 
0 <6 <2. Show that the mle of 6 is 6 = min{X, 2}. 


6.1.13. Let X 1, X2,...,X, be a random sample from a distribution with one of 
two pdfs. If 0 = 1, then f(#;@ = 1) = ee, —c <2 < oo. If @ = 2, then 
f(a;0 = 2) =1/[r(1+27)], -co < x < oo. Find the mle of 6. 


6.2 Rao—Cramér Lower Bound and Efficiency 


In this section, we establish a remarkable inequality called the Rao—Cramér lower 
bound, which gives a lower bound on the variance of any unbiased estimate. We then 
show that, under regularity conditions, the variances of the maximum likelihood 
estimates achieve this lower bound asymptotically. 

As in the last section, let X be a random variable with pdf f(x;@), 6 € Q, where 
the parameter space 2 is an open interval. In addition to the regularity conditions 
(6.1.1) of Section 6.1, for the following derivations, we require two more regularity 
conditions, namely, 


Assumptions 6.2.1 (Additional Regularity Conditions). Regularity conditions 
(R3) and (R4) are given by 


(R3) The pdf f(x;0) is twice differentiable as a function of 0. 
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(R4) The integral [ f(x;0) dx can be differentiated twice under the integral sign as 
a function of 6. 


Note that conditions (R1)-(R4) mean that the parameter 6 does not appear 
in the endpoints of the interval in which f(x;@) > 0 and that we can interchange 
integration and differentiation with respect to 6. Our derivation is for the continuous 
case, but the discrete case can be handled in a similar manner. We begin with the 


identity 
Ls / f(a; 0) dx 
Taking the derivative with respect to @ results in 


_ [% Of(z;6) 
of 0 dx. 


The latter expression can be rewritten as 


=f MB 
or, equivalently, 
fie CEN econ (6.2.1) 
= 00 
Writing this last equation as an expectation, we have established 
E oe = 0; (6.2.2) 


that is, the mean of the random variable Slog [(X39) is 0. If we differentiate (6.2.1) 
again, it follows that 


~ OF x30 ~ —l x;0) 01 x30 
ga fo PELE resort [MBLC Mes HE 50,9) as, (6.2.3) 


The second term of the right side of this equation can be written as an expectation, 
which we call Fisher information and we denote it by [(@); that is, 


Dlog f(x; 4) Alog f(x; 4) Alog f(X;0)\” 
1(6) = |  Has0) de = B cee c (6.2.4) 


From equation (6.2.3), we see that [(@) can be computed from 


=-f- 0" log f (x; 0) 7 ee 


— RS ———  f (2; 0) dx = -— G2 (6.2.5) 


Using equation (6.2.2), Fisher information is the variance of the random variable 


Olog f(X;0 
ales HO) i. 


—— (6.2.6) 


1(0) = Var ( ai 


Usually, expression (6.2.5) is easier to compute than expression (6.2.4). 
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Remark 6.2.1. Note that the information is the weighted mean of either 


Dlog f(«;0)]” J? log f (38) 
30 a; 


where the weights are given by the pdf f(x; 0). That is, the greater these derivatives 
are on the average, the more information that we get about 6. Clearly, if they were 
equal to zero [so that @ would not be in log f(x; 0)], there would be zero information 
about 0. The important function 


Olog f (a; 6) 
oo 
is called the score function. Recall that it determines the estimating equations 
for the mle; that is, the mle @ solves 


s Olog f (xi; 8) 


0 = 


i=1 
for 0. & 


Example 6.2.1 (Information for a Bernoulli Random Variable). Let X be Bernoulli 
b(1, 6). Thus 


log f(#3@) = w«xlogd+ (1-2) log(1 — 0) 
Ologf(x;@) _ z 1-2 
00 — 6 1-80 
Plog f(z;9) _ 2 1-2 
06? 62 (1 — 6)?" 
Clearly, 
—X 1-X 
1 = -8 le - aaa 
0 1-8 1 1 1 


eGo 6’ a0 bas)’ 


which is larger for 6 values close to zero or one. 


Example 6.2.2 (Information for a Location Family). Consider a random sample 
Xj ,...,Xp such that 
X; =0+6, i Des aig Ms (6.2.7) 


where €1, €2,...,€, are iid with common pdf f(a) and with support (—oo, co). Then 
the common pdf of X; is fx(x;0) = f(a — 0). We call model (6.2.7) a location 
model. Assume that f(x) satisfies the regularity conditions. Then the information 


[. (ce 8y f(a — 0) da 


- [. Ga ) eae (6.2.8) 


1(0) 
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where the last equality follows from the transformation z = xz — @. Hence, in the 
location model, the information does not depend on @. 

As an illustration, reconsider Example 6.1.1 concerning the Laplace distribution. 
Let X1, X2,..., Xn be arandom sample from this distribution. Then it follows that 
X;, can be expressed as 

X,=0+6, (6.2.9) 


where €1,...,€n are iid with common pdf f(z) = 2~! exp{—|z|}, for —oo < z < oo. 
As we did in Example 6.1.1, use 4|z| = sgn(z). Then f’(z) = —27'sgn(z) exp{—|z|} 
and, hence, [f’(z)/f(z)]? = [-sgn(z)]? = 1, so that 


1(6) =f ($2) 4 a= f f(z)dz=1. (6.2.10) 


Note that the Laplace pdf does not satisfy the regularity conditions, but this argu- 
ment can be made rigorous; see Huber (1981) and also Chapter 10. m 


From (6.2.6), for a sample of size 1, say X1, Fisher information is the vari- 
ance of the random variable Blog £1 i0) What about a sample of size n? Let 
X1,X9,...,X, be a random sample from a distribution having pdf f(2;0). The 
likelihood L(@) is the pdf of the random sample, and the random variable whose 


variance is the information in the sample is given by 


Jlog L(6,X) _ . log f(Xi5 9) 
30 - a0 


i=l 


The summands are iid with common variance [(@). Hence the information in the 


sample is 
Var (eee) = nI(6). (6.2.11) 


Thus the information in a random sample of size n is n times the information in a 
sample of size 1. So, in Example 6.2.1, the Fisher information in a random sample 
of size n from a Bernoulli 0(1, 0) distribution is n/[@(1 — 6)]. 

We are now ready to obtain the Rao—Cramér lower bound, which we state as a 
theorem. 


Theorem 6.2.1 (Rao—Cramér Lower Bound). Let X1,...,Xn be tid with common 
pdf f(a;0) for 0 € 7 Assume that the regularity cannons (RO)- cag hold. Let 


Y = u(Xq, X2,...,Xn) be a statistic with mean E(Y) = Elu(X1, X2,...,Xn)] = 
k(0). Then 
[k’())? 
Var(Y) = (6.2.12) 


Proof: The proof is for the continuous case, but the proof for the discrete case is 
quite similar. Write the mean of Y as 


=| ofa u(x1,...,2n)f(1;0)- +» f(enj 0) dy +++ dan. 
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Differentiating with respect to 0, we obtain 


mp é Of (ris 
k!(0) / af u(@1,£2,..-,;Ln) a ad 


ee a 


x f (0130) +++ f(an3 0) day +++ dan. (6.2.13) 


Define the random variable Z by Z = 37} [O log f (Xi; 0)/06]. We know from (6.2.2) 
and (6.2.11) that E(Z) = 0 and Var(Z) = nI(@), respectively. Also, equation 
(6.2.13) can be expressed in terms of expectation as k’(0) = E(Y Z). Hence we have 


k!(0) = E(YZ) = E(Y)E(Z) + poy /nI(), 


where p is the correlation coefficient between Y and Z. Using E(Z) = 0, this 
simplifies to 
k'(8) 


oy \/nI(0) 


[k’())? 
o%-n1(6) 


which, upon rearrangement, is the desired result. m 


Because p” < 1, we have 


cal 


? 


Corollary 6.2.1. Under the assumptions of Theorem 6.2.1, if Y = u(X1,...,Xn) 
is an unbiased estimator of 0, so that k(@) = 0, then the Rao—Cramér inequality 
becomes 


1 


Var(Y) > nO) 


Consider the Bernoulli model with probability of success @ which was treated in 
Example 6.2.1. In the example we showed that 1/nI(0) = 0(1—0)/n. From Example 
4.1.2 of Section 4.1, the mle of 6 is X. The mean and variance of a Bernoulli (6) 
distribution are 6 and 6(1 — @), respectively. Hence the mean and variance of X 
are @ and 0(1 — 0)/n, respectively. That is, in this case the variance of the mle has 
attained the Rao—Cramér lower bound. 

We now make the following definitions. 


Definition 6.2.1 (Efficient Estimator). Let Y be an unbiased estimator of a pa- 
rameter @ in the case of point estimation. The statistic Y is called an efficient 
estimator of 6 if and only if the variance of Y attains the Rao—Cramér lower 
bound. 
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Definition 6.2.2 (Efficiency). In cases in which we can differentiate with respect 
to a parameter under an integral or summation symbol, the ratio of the Rao—Cramér 
lower bound to the actual variance of any unbiased estimator of a parameter is called 
the efficiency of that estimator. 


Example 6.2.3 (Poisson(@) Distribution). Let X 1, X2,...,X, denote a random 
sample from a Poisson distribution that has the mean 6 > 0. It is known that X is 
an mle of 0; we shall show that it is also an efficient estimator of 0. We have 


ores Ha) = ae ae ee 


0 

x 2-6 
——j|= . 
0 0 


Accordingly, 


_E(X-6F a? _ @ 


Olog f(X;6)\" 
EB _-). ae 62 62 62 


oo 


The Rao-—Cramér lower bound in this case is 1/[n(1/0)] = @/n. But 6/n is the 
variance of X. Hence X is an efficient estimator of 0. m 


Example 6.2.4 (Beta(0,1) Distribution). Let X1, X2,...,X, denote a random 
sample of size n > 2 from a distribution with pdf 


62°! for0<a<1 


P(x;8) = { 0 elsewhere, 214) 
where the parameter space is 2 = (0,00). This is the beta distribution, (3.3.9), 
with parameters @ and 1, which we denote by beta(0,1). The derivative of the log 
of f is 
Olog f 
00 


From this we have 0? log f /00? = —0~?. Hence the information is [(@) = 07. 
Next, we find the mle of # and investigate its efficiency. The log of the likelihood 
function is 


1 
= loga+ a (6.2.15) 


(0) = 0S log x; — SJ log x; + nlog 0. 


i=l i=1 


The first partial of 1(@) is 
Ol(0) ” n 
ar 2. log x; + a (6.2.16) 


Setting this to 0 and solving for 0, the mle is 6 = —n/ So", log X;. To obtain 
the distribution of 6, let Y; = — log X;. A straight transformation argument shows 
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that the distribution is ['(1,1/0). Because the X;s are independent, Theorem 3.3.1 
shows that W = >>, Y; is I'(n, 1/0). Theorem 3.3.2 shows that 


i. (n+k-1)! 
for k > —n. So, in particular for k = —1, we get 
n 


Hence, 0 is biased, but the bias vanishes as n — oo. Also, note that the estimator 
[(n — 1)/n]@ is unbiased. For k = —2, we get 


n2 


~ =? =] = rr 
E[6?] = E(w? | Oma 1@—D’ 


~ 


and, hence, after simplifying E(6”) — [E(0)]?, we obtain 


nz 


Var(0) = OE 


From this, we can obtain the variance of the unbiased estimator [(n — 1)/n] , ice., 
—l- 6? 
Var (* a) = 
n n—2 


From above, the information is J(@#) = 6~? and, hence, the variance of an unbiased 


efficient estimator is 6?/n. Because ae > a the unbiased estimator [(n — 1)/ nl@ 
is not efficient. Notice, though, that its efficiency (as in Definition 6.2.2) converges 
to 1 as n > oo. Later in this section, we say that [(n — 1)/n]@ is asymptotically 


efficient. 


In the above examples, we were able to obtain the mles in closed form along 
with their distributions and, hence, moments. This is often not the case. Maximum 
likelihood estimators, however, have an asymptotic normal distribution. In fact, 
mles are asymptotically efficient. To prove these assertions, we need the additional 
regularity condition given by 


Assumptions 6.2.2 (Additional Regularity Condition). Regularity condition (R5) 
1s 


(R5) The pdf f(x;@) is three times differentiable as a function of 6. Further, for 
all @ € Q, there exist a constant c and a function M(x) such that 


fog 
ape f(e8) < M(x), 


with E9,[M(X)] < oo, for all 09 —¢ < 8 < 6) +¢ and all x in the support of 
XxX. 
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Theorem 6.2.2. Assume X1,...,Xn are tid with pdf f(x;@9) for 09 € Q such that 
the regularity conditions (RO)-(R5) are satisfied. Suppose further that the Fisher 
information satisfies 0 < I(@9) < oo. Then any consistent sequence of solutions of 
the mle equations satisfies 


Jn(@ — 0) 3N (0 a) (6.2.18) 


Proof: Expanding the function /’(@) into a Taylor series of order 2 about 69 and 
evaluating it at 6,, we get 


Un) =U (Oo) + (On — 00)" (0) + 5G, ~ 89)70'" (0%), (6.2.19) 


where 0% is between 09 and On. But U (On) = 0. Hence, rearranging terms, we obtain 


n—V/21'(9) 


n On — 69) = - 6.2.20 
eam —n-1"(6) — (2n)-1(On — Oo)” (0%) vem) 
By the Central Limit Theorem, 
1 peal Dow FXO) D 


because the summands are iid with Var(0 log f(X;;60)/00) = I(00) < co. Also, by 
the Law of Large Numbers, 


1 1 @? log f(Xi;%) P 
ara Cl aa = ap (6.2.22) 


To complete the proof then, we need only show that the second term in the 
denominator of expression (6.2.20) goes to zero in probability. Because 0,05 30 
by Theorem 5.2.7, this follows provided that n~1l’”(6*) is bounded in probability. 
Let co be the constant defined in condition (R5). Note that 8 — 00| < co implies 
that |0* — 00| < co, which in turn by condition (R5) implies the following string of 
inequalities: 


2 = 1" 0°) 


0° log f (Xi; 4) 
3 a Ap8 <t 5 Mex (6.2.23) 


By condition (R5), Eo, [M(X)] < 00; hence, + S7i_, M(X;) a Eo, |M(X)], by the 
Law of Large Numbers. For the bound, we select 1+ Eg,[M(X)]. Let « > 0 be 
given. Choose N; and N2 so that 


n>N, => Pl, —%|<al>1—-< (6.2.24) 


+ 


i) 


n>No => P 


IV 
— 
| 


* > MCG) ~ Bey MO) 
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It follows from (6.2.23)—(6.2.25) that 


<1+ Ee,[M(X)]| >1- 


1 
n > max{N;,No}=> P |Z") i 
nm 


hence, n~1l’”(6%) is bounded in probability. = 


We next generalize Definitions 6.2.1 and 6.2.2 concerning efficiency to the asymp- 
totic case. 


Definition 6.2.3. Let X1,...,Xn be independent and identically distributed with 
probability density function f(a;0). Suppose O1n = O01n(X1,...,Xn) ts an estimator 


of 09 such that Vnlbin — 0) aN (0, os ) . Then 
in 


(a) The asymptotic efficiency of 61m is defined to be 


e(O1n) = NO) (6.2.26) 


bin 


(b) The estimator 61, is said to be asymptotically efficient if the ratio in part 


(a) is 1. 


(c) Let 62, be another estimator such that \/7(82n — 00) 4N (0, o% ) . Then the 
2n 


asymptotic relative efficiency (ARE) of O1n to On is the reciprocal of the 
ratio of their respective asymptotic variances; 1.e., 


ae o; 
e(A1n, O2n) = ee oar (6.2.27) 
A1n 


Hence, by Theorem 6.2.2, under regularity conditions, maximum likelihood es- 
timators are asymptotically efficient estimators. This is a nice optimality result. 
Also, if two estimators are asymptotically normal with the same asymptotic mean, 
then intuitively the estimator with the smaller asymptotic variance would be se- 
lected over the other as a better estimator. In this case, the ARE of the selected 
estimator to the nonselected one is greater than 1. 


Example 6.2.5 (ARE of the Sample Median to the Sample Mean). We obtain 
this ARE under the Laplace and normal distributions. Consider first the Laplace 
location model as given in expression (6.2.9); ie., 

By Example 6.1.1, we know that the mle of 6 is the sample median, Q2. By (6.2.10), 
the information I(69) = 1 for this distribution; hence, Q2 is asymptotically normal 
with mean @ and variance 1/n. On the other hand, by the Central Limit Theorem, 
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the sample mean X is asymptotically normal with mean 6 and variance o7/n, where 
o? = Var(X;) = Var(e; + 6) = Var(e;) = E(e?). But 


E(e?) = [- 272-1 exp{—|z|} dz = i zg? exp{—z} dz =1(3) =2. 


Therefore, the ARE(Q2,X) = 7 = 2. Thus, if the sample comes from a Laplace 
distribution, then asymptotically the sample median is twice as efficient as the 
sample mean. 

Next suppose the location model (6.2.28) holds, except now the pdf of e; is 
N(0,1). Under this model, by Theorem 10.2.3, Q2 is asymptotically normal with 
mean @ and variance (7/2)/n. Because the variance of X is 1/n, in this case, the 
ARE(Q2, X) = a = 2/r = 0.636. Since 7/2 = 1.57, asymptotically, X is 1.57 
times more efficient than @z2 if the sample arises from the normal distribution. m 


Theorem 6.2.2 is also a practical result in that it gives us a way of doing inference. 
The asymptotic standard deviation of the mle @ is [nI(0)|~!/?. Because I(0) is a 
continuous function of 0, it follows from Theorems 5.1.4 and 6.1.2 that 


1(6n) 7 I(60). 


Thus we have a consistent estimate of the asymptotic standard deviation of the mle. 
Based on this result and the discussion of confidence intervals in Chapter 4, for a 
specified 0 < a < 1, the following interval is an approximate (1—a)100% confidence 
interval for 0, 

Zs 1 és 

On — £a [2S On + Za/2 


— (6.2.29) 
nI (On) 


1 
\/nI (On) 
Remark 6.2.2. If we use the asymptotic distributions to construct confidence 
intervals for 6, the fact that the ARE(Q2, X) = 2 when the underlying distribution 


is the Laplace means that n would need to be twice as large for X to get the same 
length confidence interval as we would if we used Qo. 


A simple corollary to Theorem 6.2.2 yields the asymptotic distribution of a 
function g(@,,) of the mle. 


Corollary 6.2.2. Under the assumptions of Theorem 6.2.2, suppose g(x) is a con- 
tinuous function of « that is differentiable at 09 such that g'(@9) #0. Then 


Vii(g(8n) — (80) > N (0. a ) ; (6.2.30) 


The proof of this corollary follows immediately from the A-method, Theorem 
5.2.9, and Theorem 6.2.2. 

The proof of Theorem 6.2.2 contains an asymptotic representation of @ which 
proves useful; hence, we state it as another corollary. 
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Corollary 6.2.3. Under the assumptions of Theorem 6.2.2, 
~ Pts it * log f(Xi; 90) 

in — 90) ae 2.31 

Jn 0) Wey iad +R (6.2.31) 


where Ry AG, 


The proof is just a rearrangement of equation (6.2.20) and the ensuing results in 
the proof of Theorem 6.2.2. 


Example 6.2.6 (Example 6.2.4, Continued). Let X1,...,X, be a random sample 
having the common pdf (6.2.14). Recall that [(#) = 0~? and that the mle is 
0 = —n/>"_, log X;. Hence, 0 is approximately normally distributed with mean 0 
and variance 6?/n. Based on this, an approximate (1 — a)100% confidence interval 
for @ is 


n~ 


Le Za/2 Vn . 

Recall that we were able to obtain the exact distribution of @ in this case. As 
Exercise 6.2.12 shows, based on this distribution of @, an exact confidence interval 
for 6 can be constructed. @ 


0+ 


In obtaining the mle of 6, we are often in the situation of Example 6.1.2; that 
is, we can verify the existence of the mle, but the solution of the equation 1/(0) = 
0 cannot be obtained in closed form. In such situations, numerical methods are 
used. One iterative method that exhibits rapid (quadratic) convergence is Newton’s 
method. The sketch in Figure 6.2.1 helps recall this method. Suppose 6) is an 
initial guess at the solution. The next guess (one-step estimate) is the point g), 
which is the horizontal intercept of the tangent line to the curve /'(@) at the pot 
(8), 1/(@0))). A little algebra finds 


(9) 


am) — go) — EW) 
17(9(0)) 


(6.2.32) 


We then substitute 0 for 9 and repeat the process. On the figure, trace the 
second step estimate ge). the process is continued until convergence. 


Example 6.2.7 (Example 6.1.2, continued). Recall Example 6.1.2, where the ran- 
dom sample Xj,...,X,, has the common logistic density 


jeg = ee 
(1 + exp{—(a — 4)}) 
We showed that the likelihood equation has a unique solution, though it cannot be 
be obtained in closed form. To use formula (6.2.32), we need the first and second 
partial derivatives of 1(@) and an initial guess. Expression (6.1.9) of Example 6.1.2 
gives the first partial derivative, from which the second partial is 


—00 <u <0, -~<A<o. (6.2.33) 


” exp{—(x; — 9)} 
et =); 4 expla — > 
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ad 4 


Figure 6.2.1: Beginning with the starting value gO), the one-step estimate is 
0), which is the intersection of the tangent line to the curve l/(@) at 0 and the 
horizontal axis. In the figure, dl(0) = 1'(@). 


The logistic distribution is similar to the normal distribution; hence, we can use 
X as our initial guess of 0. The R function mlelogistic, at the site listed in the 
preface, computes the k-step estimates. m 


We close this section with a remarkable fact. The estimate 0) in equation 
(6.2.32) is called the one-step estimator. As Exercise 6.2.15 shows, this estimator 
has the same asymptotic distribution as the mle |[i.e., (6.2.18)], provided that the 
initial guess 9) is a consistent estimator of 6. That is, the one-step estimate is an 
asymptotically efficient estimate of #. This is also true of the other iterative steps. 


EXERCISES 


6.2.1. Prove that X, the mean of a random sample of size n from a distribution 
that is N(0@,07), —oo < @ < oo, is, for every known o? > 0, an efficient estimator 
of 0. 


6.2.2. Given f(x;@) = 1/0, 0 < x < 0, zero elsewhere, with 0 > 0, formally 


compute the reciprocal of 
Alog f(X :0)]? 
E¢ | — ; 
" | 50 
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Compare this with the variance of (n+ 1)Y,,/n, where Y,, is the largest observation 
of a random sample of size n from this distribution. Comment. 


6.2.3. Given the pdf 


1 
f(x;0) ait e—O’ oo <4u<oo, —c <A0<o, 
show that the Rao—Cramér lower bound is 2/n, where n is the size of a random sam- 
ple from this Cauchy distribution. What is the asymptotic distribution of ,/n(6— 9) 


if 0 is the mle of 0? 
6.2.4. Consider Example 6.2.2, where we discussed the location model. 


(a) Write the location model when e; has the logistic pdf given in expression 
(4.4.11). 


(b) Using expression (6.2.8), show that the information J(@) = 1/3 for the model 
in part (a). Hint: In the integral of expression (6.2.8), use the substitution 
u=(1+e-*)~!. Then du = f(z)dz, where f(z) is the pdf (4.4.11). 


6.2.5. Using the same location model as in part (a) of Exercise 6.2.4, obtain the 
ARE of the sample median to mle of the model. 

Hint: The mle of @ for this model is discussed in Example 6.2.7. Furthermore, as 
shown in Theorem 10.2.3 of Chapter 10, Q2 is asymptotically normal with asymp- 
totic mean 6 and asymptotic variance 1/(4f7(0)n). 


6.2.6. Consider a location model (Example 6.2.2) when the error pdf is the con- 
taminated normal (3.4.17) with € as the proportion of contamination and with o? 
as the variance of the contaminated part. Show that the ARE of the sample median 
to the sample mean is given by 


e(Qo, X) = at + (oe — VI[l ~ € + (e/oe)]* (6.2.34) 


Tv 


Use the hint in Exercise 6.2.5 for the median. 


(a) If o? = 9, use (6.2.34) to fill in the following table: 


re [0 [005 | OT] 05 | 


je(@2X){ [ | | | 


(b) Notice from the table that the sample median becomes the “better” estimator 
when € increases from 0.10 to 0.15. Determine the value for € where this occurs 
[this involves a third-degree polynomial in €, so one way of obtaining the root 
is to use the Newton algorithm discussed around expression (6.2.32)]. 


6.2.7. Recall Exercise 6.1.1 where X1, X2,...,Xy is a random sample on X that 
has a (a = 4, G6 = @) distribution, 0 < 6 < ow. 
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(a) Find the Fisher information I(6). 


(b) Show that the mle of 6, which was derived in Exercise 6.1.1, is an efficient 
estimator of 0. 


(c) Using Theorem 6.2.2, obtain the asymptotic distribution of Jno — 0). 


(d) For the data of Example 6.1.1, find the asymptotic 95% confidence interval 
for 6. 


6.2.8. Let X be N(0,0), 0<@< oo. 
(a) Find the Fisher information [(0). 


(b) If X1, X2,...,Xn is a random sample from this distribution, show that the 
mle of @ is an efficient estimator of 0. 


(c) What is the asymptotic distribution of Vn(6 - 0)? 
6.2.9. If X1, X2,..., Xp, is a random sample from a distribution with pdf 
30° 
f(a;0) = GF 0<4<w,0<A0<ow 
0) elsewhere, 
show that Y = 2X is an unbiased estimator of @ and determine its efficiency. 


6.2.10. Let X1, X2,...,X, be a random sample from a N(0,6) distribution. We 
want to estimate the standard deviation /@. Find the constant c so that Y = 
c>-_, |Xi| is an unbiased estimator of V and determine its efficiency. 


6.2.11. Let X be the mean of a random sample of size n from a N(6, 07) distribu- 
tion, —o0o < @ < w,07 > 0. Assume that o? is known. Show that x= = is an 
unbiased estimator of 6? and find its efficiency. 


6.2.12. Recall that @ = —n/ yor, log X; is the mle of 6 for a beta(9, 1) distribution. 
Also, W = —5*"_, log X; has the gamma distribution I'(n, 1/6). 


i=1 


(a) Show that 20W has a y?(2n) distribution. 


(b) Using part (a), find c; and cz so that 
20 
P (« 2 a < or) silo (6.2.35) 


for 0 <a<1. Next, obtain a (1 — a)100% confidence interval for 0. 


(c) For a = 0.05 and n = 10, compare the length of this interval with the length 
of the interval found in Example 6.2.6. 


6.2.13. The data file beta30. rda contains 30 observations generated from a beta(0, 1) 
distribution, where 0 = 4. The file can be downloaded at the site discussed in the 
Preface. 
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(a) Obtain a histogram of the data using the argument pr=T. Overlay the pdf of 
a 3(4,1) pdf. Comment. 


(b) Using the results of Exercise 6.2.12, compute the maximum likelihood estimate 
based on the data. 


(c) Using the confidence interval found in Part (c) of Exercise 6.2.12, compute 
the 95% confidence interval for 9 based on the data. Is the confidence interval 
successful? 


6.2.14. Consider sampling on the random variable X with the pdf given in Exercise 
6.2.9. 


(a) Obtain the corresponding cdf and its inverse. Show how to generate observa- 
tions from this distribution. 


(b) Write an R function that generates a sample on X. 


(c) Generate a sample of size 50 and compute the unbiased estimate of 0 discussed 
in Exercise 6.2.9. Use it and the Central Limit Theorem to compute a 95% 
confidence interval for @. 


6.2.15. By using expressions (6.2.21) and (6.2.22), obtain the result for the one-step 
estimate discussed at the end of this section. 


6.2.16. Let S? be the sample variance of a random sample of size n > 1 from 
N(u, 0), 0 < @ < 00, where yw is known. We know E(S”) = 0. 


(a) What is the efficiency of $?? 
(b) Under these conditions, what is the mle 6 of 6? 


(c) What is the asymptotic distribution of /n(@ — 0)? 


6.3. Maximum Likelihood Tests 


In the last section, we presented an inference for pointwise estimation and confidence 
intervals based on likelihood theory. In this section, we present a corresponding 
inference for testing hypotheses. 

As in the last section, let X),...,Xn be iid with pdf f(a;6) for 0 € Q. In this 
section, @ is a scalar, but in Sections 6.4 and 6.5 extensions to the vector-valued 
case are discussed. Consider the two-sided hypotheses 


Ho: 0 = versus Hy: 04 6, (6.3.1) 


where 6 is a specified value. 
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Recall that the likelihood function and its log are given by 


ea 
S 
T 


[]reue) 


(0) = S log f(Xi;4). 


i=l 


Let @ denote the maximum likelihood estimate of 0. 

To motivate the test, consider Theorem 6.1.1, which says that if 6 is the true 
value of 6, then, asymptotically, L(@9) is the maximum value of L(@). Consider the 
ratio of two likelihood functions, namely, 


hee (6.3.2) 


Note that A < 1, but if Ho is true, A should be large (close to 1), while if H; is true, 
A should be smaller. For a specified significance level a, this leads to the intuitive 
decision rule 

Reject Ho in favor of Hy if A <c, (6.3.3) 


where c is such that a = P9,[A < c]. We call it the likelihood ratio test (LRT). 
Theorem 6.3.1 derives the asymptotic distribution of A under Ho, but first we look 
at two examples. 


Example 6.3.1 (Likelihood Ratio Test for the Exponential Distribution). Sup- 
pose Xj,...,X, are iid with pdf f(2;0) = 0-'exp{—2x/6}, for 2,9 > 0. Let the 
hypotheses be given by (6.3.1). The likelihood function simplifies to 


L(0) = 0" exp {—(n/0)X}. 


From Example 4.1.1, the mle of 6 is X. After some simplification, the likelihood 
ratio test statistic simplifies to 


A=e" (=) , exp {—nX/Oo}. (6.3.4) 


The decision rule is to reject Hp if A < c. But further simplification of the test is 
possible. Other than the constant e”, the test statistic is of the form 


g(t) =t” exp{—nt}, t>0, 


where t = Z/09. Using differentiable calculus, it is easy to show that g(t) has 
a unique critical value at 1, ie., g/(1) = 0, and further that t = 1 provides a 
maximum, because g/’(1) < 0. As Figure 6.3.1 depicts, g(t) < c if and only if 
t<c, or t > cy. This leads to 


A <c, if and only if, = <c¢, or > > co. 
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Note that under the null hypothesis, Ho, the statistic (2/09) 307, X; has a x? 
distribution with 2n degrees of freedom. Based on this, the following decision rule 
results in a level a test: 


Reject Ho if (2/80) Tey Xi $ x2) (2n) or (2/60) Wy Xi > x2/9(2n), (6.3.5) 


where x?_., jg(2n) is the lower a/2 quantile of a x? distribution with 2n degrees 
of freedom and 2 j2(2n) is the upper a/2 quantile of a x? distribution with 2n 
degrees of freedom. Other choices of c; and cg can be made, but these are usually 
the choices used in practice. Exercise 6.3.2 investigates the power curve for this 
test. 


a(t) 
ry 


Figure 6.3.1: Plot for Example 6.3.1, showing that the function g(t) < c if and 


only if t < cy or t> co. 


Example 6.3.2 (Likelihood Ratio Test for the Mean of a Normal pdf). Consider 
a random sample X1, X2,...,Xn from a N(6,07) distribution where —oo < 6 < 00 
and o? > 0 is known. Consider the hypotheses 

Ho: 0 =o versus Hy: 04 6, 


where 6 is specified. The likelihood function is 


(stx)" os (202)! S (a; — 6)? 
= rite a ah 


Of course, in Q = {0 : —co < 0 < cw}, the mle is 6 = X and thus 


L(6) 


exp{—( 


L(90) 


A=— 
L(8) 


= exp{—(20?)*n(X — 0)?}. 
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Then A < c is equivalent to —2log A > —2log c. However, 
gute 
af//n } ? 


which has a y?(1) distribution under Ho. Thus, the likelihood ratio test with 
significance level a states that we reject Hp and accept H, when 


—2log A= ( 


—2log A= — Sx_ (1). (6.3.6) 


Note that this test is the same as the z-test for a normal mean discussed in Chapter 
4 with s replaced by a. Hence, the power function for this test is given in expression 
(4.6.5). a 


Other examples are given in the exercises. In these examples the likelihood ratio 
tests simplify and we are able to get the test in closed form. Often, though, this 
is impossible. In such cases, similarly to Example 6.2.7, we can obtain the mle by 
iterative routines and, hence, also the test statistic A. In Example 6.3.2, —2log A 
had an exact y?(1) null distribution. While not true in general, as the following 
theorem shows, under regularity conditions, the asymptotic null distribution of 
—2logA is y? with one degree of freedom. Hence in all cases an asymptotic test 
can be constructed. 


Theorem 6.3.1. Assume the same regularity conditions as for Theorem 6.2.2. 
Under the null hypothesis, Hy : 6 = 6, 


—2logA 2% y2(1). (6.3.7) 


Proof: Expand the function /(@) into a Taylor series about 4 of order 1 and evaluate 


Nn 


it at the mle, @. This results in 
a ti 
1(8) = 1(80) + (8 — 80)I"(80) + 59 — 00)71" (6%), (6.3.8) 


where 6* is between @ and 0. Because 65 Mo, it follows that 6* ci 4. This, in 
addition to the fact that the function /’(@) is continuous, and equation (6.2.22) of 
Theorem 6.2.2 imply that 


1 
~=1" (6%) 5 (8p). (6.3.9) 
n 
By Corollary 6.2.3, 
1 
Jn 
where R,, — 0, in probability. If we substitute (6.3.9) and (6.3.10) into expression 
(6.3.8) and do some simplification, we have 


—2log A = 2(1(0) — 1(0o)) = {/nI(00)(8 — 9) }? + R*, (6.3.11) 


(09) = Vn(0 — 0) I(80) + Rn, (6.3.10) 
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where R= — 0, in probability. By Theorems 5.2.4 and 6.2.2, the first term on the 
right side of the above equation converges in distribution to a y?-distribution with 
one degree of freedom. 


Define the test statistic x? = —2log A. For the hypotheses (6.3.1), this theorem 
suggests the decision rule 


Reject Ho in favor of Hy if x7, > y2(1). (6.3.12) 


By the last theorem, this test has asymptotic level a. If we cannot obtain the test 
statistic or its distribution in closed form, we can use this asymptotic test. 

Besides the likelihood ratio test, in practice two other likelihood-related tests 
are employed. A natural test statistic is based on the asymptotic distribution of 0. 


Consider the statistic ; 
y= {yn1@@- a} (6.3.13) 


n~ 


Because [(@) is a continuous function, [(@) — (69) in probability under the null 
hypothesis, (6.3.1). It follows, under Ho, that x7, has an asymptotic y?-distribution 
with one degree of freedom. This suggests the decision rule 


Reject Ho in favor of Hy if x4, > x2(1). (6.3.14) 


As with the test based on x7, this test has asymptotic level a. Actually, the 
relationship between the two test statistics is strong, because as equation (6.3.11) 
shows, under Ho, 


Xv — XZ, > 0. (6.3.15) 


The test (6.3.14) is often referred to as a Wald-type test, after Abraham Wald, 
who was a prominent statistician of the 20th century. 

The third test is called a scores-type test, which is often referred to as Rao’s 
score test, after another prominent statistician, C. R. Rao. The scores are the 
components of the vector 


_ ( Olog f(X1;9) Jlog f(Xn30)\' 
S(@) = ee a ee 7 (6.3.16) 
In our notation, we have 
1 1 _ 1 ” Olog f (X;; 90) 
Define the statistic P 
l’(8) 

2=-/—i* ). 6.3.18 
XR ( oo) ( ) 


Under Ho, it follows from expression (6.3.10) that 


XR = Xi + Ron, (6.3.19) 
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where Ro, converges to 0 in probability. Hence the following decision rule defines 
an asymptotic level a test under Ho: 


Reject Ho in favor of Hy if x} > x2,(1). (6.3.20) 


Example 6.3.3 (Example 6.2.6, Continued). As in Example 6.2.6, let X1,..., Xn 
be a random sample having the common beta(6, 1) pdf (6.2.14). We use this pdf to 
illustrate the three test statistics discussed above for the hypotheses 


Ho :@=1 versus H;: 61. (6.3.21) 


Under Ho, f(2;9) is the uniform(0,1) pdf. Recall that 6 = —n/ yor, log X; is the 
mle of 8. After some simplification, the value of the likelihood function at the mle 


is 
L(@) = (- ms log x) exp {- s log x exp {n(logn — 1)}. 


Also, L(1) = 1. Hence the likelihood ratio test statistic is A = 1/L(0), so that 


x2 =—-2logA = 2 {- Soke — nlog (Sve n+ mkagn : 


i=l i=l 


Recall that the information for this pdf is [(@) = 0~?. For the Wald-type test, we 
would estimate this consistently by @~. The Wald-type test simplifies to 


n i a 
w= =(@-1)) =n{i-a} ' 6.3.22 
r= (R@-0 : (6.3.22) 
Finally, for the scores-type course, recall from (6.2.15) that the l’(1) is 


W)= > log X; +n. 


i=1 
Hence the scores-type test statistic is 
2 
2 yy, log Xi +n 
XR Jn : 


It is easy to show that expressions (6.3.22) and (6.3.23) are the same. From Example 
6.2.4, we know the exact distribution of the maximum likelihood estimate. Exercise 
6.3.8 uses this distribution to obtain an exact test. Hl 


(6.3.23) 


Example 6.3.4 (Likelihood Tests for the Laplace Location Model). Consider the 
location model 

X;,=0+6, a een 
where —oo < 0 < ow and the random errors e;s are iid each having the Laplace pdf, 
(2.2.4). Technically, the Laplace distribution does not satisfy all of the regularity 
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conditions (RO)-(R5), but the results below can be derived rigorously; see, for 
example, Hettmansperger and McKean (2011). Consider testing the hypotheses 
Hy: 0 =o versus Hy: 04 6, 


where 9 is specified. Here 2 = (—co,0o) and w = {09}. By Example 6.1.1, we 
know that the mle of 6 under 2 is Q2 = med{X,...,X,}, the sample median. It 


follows that 
L(Q) = 27" exp {-32 \a5 — au ; 
i=1 


while 
L(@) = es Ys al} 


Hence the negative of twice the log of the likelihood ratio test statistic is 


—2log A=2 bs |x; — Oo] — S> Jai — asl ; (6.3.24) 
i=1 i=1 


Thus the size a asymptotic likelihood ratio test for Hp versus H, rejects Ho in favor 


of Ay if 
2 bs |i — 0] — So ari - a, aye (1). 


i=1 i=1 
By (6.2.10), the Fisher information for this model is [(#) = 1. Thus, the Wald-type 


test statistic simplifies to 
Xv = [Vn(Q2 — 40)]?. 


For the scores test, we have 


Olog f(x; -—0) 0 
——a9 = 86 pe 5 — |x; — O|| = sgn(a; — 0). 
Hence the score vector for this model is S(@) = (sgn(X, — @),...,sgn(X, — @))’. 
From the above discussion [see equation (6.3.17)], the scores test statistic can be 
written as 
zn = (S*)?/n, 
where 


= Yan Xj; — 9). 


As Exercise 6.3.5 shows, under Ho, S* is a linear function of a random variable with 
a b(n, 1/2) distribution. m 


Which of the three tests should we use? Based on the above discussion, all three 
tests are asymptotically equivalent under the null hypothesis. Similarly to the con- 
cept of asymptotic relative efficiency (ARE), we can derive an equivalent concept 
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of efficiency for tests; see Chapter 10 and more advanced books such as Hettman- 
sperger and McKean (2011). However, all three tests have the same asymptotic 
efficiency. Hence, asymptotic theory offers little help in separating the tests. Finite 
sample comparisons have not shown that any of these tests are “best” overall; see 
Chapter 7 of Lehmann (1999) for more discussion. 


EXERCISES 


6.3.1. The following data were generated from an exponential distribution with pdf 
f(x;0) = (1/0)e-*/°, for z > 0, where 6 = 40. 


(a) Histogram the data and locate 69 = 50 on the plot. 


(b) Use the test described in Example 6.3.1 to test Ho : @ = 50 versus Hy : 6 4 50. 
Determine the decision at level a = 0.10. 
19 15 76 23 24 66 27 12 25 7 6 16 51 26 39 


6.3.2. Consider the decision rule (6.3.5) derived in Example 6.3.1. Obtain the 
distribution of the test statistic under a general alternative and use it to obtain 
the power function of the test. Using R, sketch this power curve for the case when 
4) = 1, n= 10, and a = 0.05. 


6.3.3. Show that the test with decision rule (6.3.6) is like that of Example 4.6.1 
except that here a? is known. 


6.3.4. Obtain an R function that plots the power function discussed at the end of 
Example 6.3.2. Run your function for the case when 09 = 0, n = 10, o? = 1, and 
a = 0.05. 


6.3.5. Consider Example 6.3.4. 
(a) Show that we can write S* = 2T — n, where T = #{X; > Oo}. 


(b) Show that the scores test for this model is equivalent to rejecting Ho if T < c1 
or T > co. 


(c) Show that under Ho, T has the binomial distribution b(n, 1/2); hence, deter- 
mine c, and c2 so that the test has size a. 


(d) Determine the power function for the test based on T as a function of 0. 


6.3.6. Let X1, X2,...,Xn be a random sample from a N(j19,07 = @) distribution, 
where 0 < 6 < co and pp is known. Show that the likelihood ratio test of Hp : 0 = 0 
versus H; : 6 # 6 can be based upon the statistic W = S7i_,(Xi — po)? /O0.- 
Determine the null distribution of W and give, explicitly, the rejection rule for a 
level a test. 


6.3.7. For the test described in Exercise 6.3.6, obtain the distribution of the test 
statistic under general alternatives. If computational facilities are available, sketch 
this power curve for the case when #) = 1, n = 10, wu = 0, and a = 0.05. 
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6.3.8. Using the results of Example 6.2.4, find an exact size a test for the hypotheses 
(6.3.21). 


6.3.9. Let X 1, X2,...,Xy be a random sample from a Poisson distribution with 
mean 6 > 0. 


(a) Show that the likelihood ratio test of Hp : 0 = 09 versus H, : 0 4 0p is based 
upon the statistic Y = $0"_, X;. Obtain the null distribution of Y. 


(b) For 6) = 2 and n = 5, find the significance level of the test that rejects Ho if 
Y <4oryY 217. 


6.3.10. Let X1,Xo,..., Xp be a random sample from a Bernoulli (1,6) distribu- 
tion, where 0 < 6 < 1. 


(a) Show that the likelihood ratio test of Hp : 0 = 09 versus H, : 0 4 0p is based 
upon the statistic Y = $0", X;. Obtain the null distribution of Y. 


(b) For n = 100 and 5 = 1/2, find c; so that the test rejects Hyp when Y < cy, or 
Y > cz = 100 — c, has the approximate significance level of a = 0.05. Hint: 
Use the Central Limit Theorem. 


6.3.11. Let X1, X2,...,X, be a random sample from a I'(a = 4,3 = @) distribu- 
tion, where 0 < 6 < co. 


(a) Show that the likelihood ratio test of Hp : 6 = 09 versus H, : 0 4 @ is based 
upon the statistic W = )7_, X;. Obtain the null distribution of 2W/4. 


(b) For 09 = 3 and n = 5, find c, and c2 so that the test that rejects Hp when 
W <c, or W > c has significance level 0.05. 


6.3.12. Let X1,X2,...,Xn be a random sample from a distribution with pdf 
f(a;0) = Oexp {—|z|°} /2T (1/0), —oo < a < oo, where 0 > 0. Suppose 2 = 
{0:0 =1,2}. Consider the hypotheses Hp : 6 = 2 (a normal distribution) versus 
H,: 0 =1 (a double exponential distribution). Show that the likelihood ratio test 
can be based on the statistic W = 7), (X? — |X;|). 


6.3.13. Let X1, Xo,...,Xn be a random sample from the beta distribution with 
a=68=@andQ = {6:6 = 1,2}. Show that the likelihood ratio test statistic 
A for testing Ho : 0 = 1 versus Hi : 06 = 2 is a function of the statistic W = 
Te, log Xe + HM, log (1 — XK). 


6.3.14. Consider a location model 


X; =06 + ei, lees 5 (6.3.25) 
where €1,€2,...,@n are iid with pdf f(z). There is a nice geometric interpretation 
for estimating 0. Let XK = (Xj,...,X,)/ and e = (e1,...,@n)/ be the vectors of 
observations and random error, respectively, and let 4 = 61, where 1 is a vector 
with all components equal to 1. Let V be the subspace of vectors of the form p; 
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ie., V = {v:v=al, for some a € R}. Then in vector notation we can write the 
model as 
X=pt+e, pev. (6.3.26) 


Then we can summarize the model by saying, “Except for the random error vector 
e, X would reside in V.” Hence, it makes sense intuitively to estimate ws by a vector 
in V that is “closest” to X. That is, given a norm || - || in R”, choose 


ji = Argmin||X — v||, ve V. (6.3.27) 


(a) If the error pdf is the Laplace, (2.2.4), show that the minimization in (6.3.27) 
is equivalent to maximizing the likelihood when the norm is the /; norm given 
by 


llvilh = Dole. (6.3.28) 


(b) If the error pdf is the N(0, 1), show that the minimization in (6.3.27) is equiv- 
alent to maximizing the likelihood when the norm is given by the square of 
the /2 norm 


IIlvilIf = 55 v?. (6.3.29) 
i=l 


6.3.15. Continuing with Exercise 6.3.14, besides estimation there is also a nice 
geometric interpretation for testing. For the model (6.3.26), consider the hypotheses 


Ho: 0 = versus Hy: 064 4, (6.3.30) 


where 9 is specified. Given a norm || - || on R”, denote by d(X,V) the distance 
between X and the subspace V; ie., d(X,V) = ||X — p||, where p is defined in 
equation (6.3.27). If Ho is true, then fi should be close to w = 691 and, hence, 
||X — 091|| should be close to d(X,V). Denote the difference by 


RD =||X — 91|| — |X —2i). (6.3.31) 


Small values of RD indicate that the null hypothesis is true, while large values 
indicate H,. So our rejection rule when using RD is 


Reject Ho in favor of Hy if RD > c. (6.3.32) 


(a) If the error pdf is the Laplace, (6.1.6), show that expression (6.3.31) is equiv- 
alent to the likelihood ratio test when the norm is given by (6.3.28). 


(b) If the error pdf is the N(0,1), show that expression (6.3.31) is equivalent to 
the likelihood ratio test when the norm is given by the square of the /2 norm, 
(6.3.29). 


6.3.16. Let X 1, X2,...,X, be a random sample from a distribution with pmf 
p(x;0) = 97(1 — 6)1-*, = 0,1, where 0 < 0 < 1. We wish to test Hp: 6 = 1/3 
versus Hy: 0 41/3. 
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(a) Find A and —2log A. 
(b) Determine the Wald-type test. 
(c) What is Rao’s score statistic? 


6.3.17. Let X 1, X2,...,Xn be a random sample from a Poisson distribution with 
mean 0 > 0. Consider testing Hp : 6 = 00 against Hy : 04%. 


(a) Obtain the Wald type test of expression (6.3.13). 
(b) Write an R function to compute this test statistic. 


(c) For 4) = 23, compute the test statistic and determine the p-value for the 
following data. 
27 13 21 24 22 14 17 26 14 22 
21 24 19 25 15 25 23 16 20 19 


6.3.18. Let X1, X2,...,X,n be a random sample from a ['(a, 3) distribution where 
a is known and ( > 0. Determine the likelihood ratio test for Hp : @ = (Go against 


Ay: BF Bo. 


6.3.19. Let Y; < Yo <---< Y, be the order statistics of a random sample from a 
uniform distribution on (0,0), where 0 > 0. 


(a) Show that A for testing Hp : 6 = 0 against Hy: 0 4 0 is A = (Yn/O0)", 
Yas A, and A=O0if Y, > A. 


(b) When Hp is true, show that —2log A has an exact y?(2) distribution, not 
x7(1). Note that the regularity conditions are not satisfied. 


6.4 Multiparameter Case: Estimation 


In this section, we discuss the case where @ is a vector of p parameters. There 
are analogs to the theorems in the previous sections in which @ is a scalar, and we 
present their results but, for the most part, without proofs. The interested reader 
can find additional information in more advanced books; see, for instance, Lehmann 
and Casella (1998) and Rao (1973). 

Let X1,...,X,p, be iid with common pdf f(2;@), where 9@€ QC R?. As before, 
the likelihood function and its log are given by 


n 


L@) = J] fae) 
(@) = log L(8) = log f(x: 8), (6.4.1) 


for 8 € Q. The theory requires additional regularity conditions, which are listed in 
Appendix A, (A.1.1). In keeping with our number scheme in the last three sections, 
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we have labeled these (R6)—(R9). In this section, when we say “under regularity 
conditions,” we mean all of the conditions of (6.1.1), (6.2.1), (6.2.2), and (A.1.1) 
that are relevant to the argument. The discrete case follows in the same way as the 
continuous case, so in general we state material in terms of the continuous case. 

Note that the proof of Theorem 6.1.1 does not depend on whether the parameter 
is a scalar or a vector. Therefore, with probability going to 1, L(@) is maximized at 
the true value of 8. Hence, as an estimate of 0 we consider the value that maximizes 
L(0) or equivalently solves the vector equation (0/00)1(@) = 0. If it exists, this 
value is called the maximum likelihood estimator (mle) and we denote it by f) 
Often we are interested in a function of 0, say, the parameter 7 = g(@). Because the 
second part of the proof of Theorem 6.1.2 remains true for 8 as a vector, 7 = (0) 
is the mle of 7. 


Example 6.4.1 (Maximum Likelihood Estimates Under the Normal Model). Sup- 
pose Xj,...,X, are iid N(u,07). In this case, @ = (,07)! and Q is the product 
space (—oo, co) x (0,00). The log of the likelihood simplifies to 


n i, 
Oi ne .— log 27 — nloga — 352 yO — p)?. (6.4.2) 
i=l 


Taking partial derivatives of (6.4.2) with respect to and o and setting them to 0, 
we get the simultaneous equations 


al 1< 
au a a3 (ti — H) = 


i=l 
al n 1< re 
— = --+5 Li =0. 
Oo o i a3 ae ) 
Solving these equations, we obtain @ = X and @ = 4/(1/n)>>/_, (Xi — X)? 


solutions. A check of the second partials shows that these maximize (1,07), so 
these are the mles. Also, by Theorem 6.1.2, (1/n) )7j_, (Xi — X)? is the mle of o?. 
We know from our discussion in Section 5.1 that these are consistent_estimates of 
p and o?, respectively, that 7 is an unbiased estimate of jz, and that o? is a biased 
estimate of a? whose bias vanishes as n — 00. & 


Example 6.4.2 (General Laplace pdf). Let X1,X2,...,X», be a random sample 
from the Laplace pdf fx(x) = (2b)~! exp{—|x — a/b}, —oo < x < oo, where the 
parameters (a,b) are in the space 2 = {(a,b) : —co < a < w,b > O}. Recall in 
Section 6.1 that we looked at the special case where b = 1. As we now show, the 
mle of a is the sample median, regardless of the value of b. The log of the likelihood 
function is 

via 


I(a,b) = —nlog 2—nlog b— S> 


i=1 


The partial of I(a,b) with respect to a is 


= (a, b) 29 sen {2 


1 n 
“| =| >» sgn{x; — a}, 
i=1 
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where the second equality follows because b > 0. Setting this partial to 0, we obtain 
the mle of a to be Q2 = med{X1, X2,..., Xn}, just as in Example 6.1.1. Hence the 
mle of a is invariant to the parameter b. Taking the partial of l(a, b) with respect 


to b, we obtain 
Ol(a, b) i is 
nn SP aah 


Setting to 0 and solving the two equations ean aeaaey we obtain, as the mle of 
b, the statistic 


a. fi 
b= |X; = Qo. | 
Recall that the Fisher information in the scalar case was the variance of the 
random variable (0/00) log f(X;@). The analog in the multiparameter case is the 
variance-covariance matrix of the gradient of log f(X;@), that is, the variance- 
covariance matrix of the random vector given by 


Olog f(X;6) ete 


V log f(X; 6) = (ee D0 (6.4.3) 
P 


Fisher information is then defined by the p x p matrix 
I(0) = Cov (7 log f(X;9)). (6.4.4) 
The (j,&)th entry of I(@) is given by 


om log f(X; 0)) Ih = lene: (6.4.5) 


O 
Tin = cov | — log f(X;@ 
jk Vv (7 g pil ’ ) 
As in the scalar case, we can simplify this by using the identity 1 = [ f(a; 6) dz. 
Under the regularity conditions, as discussed in the second paragraph of this section, 
the partial derivative of this identity with respect to 0; results in 


om f ap Flai8)az = | | apeeseo)| f(a: 0)d 
= B | a tow £06) (6.4.6) 


Next, on both sides of the first equality above, take the partial derivative with 
respect to 0,. After simplification, this results in 


a l(a log f(a; 0)) f(a; 0) dx 
+f (gp on foi) gp tow f(0:6)) #08) de 


that is, 


ape log f(X; 0) ag log F(X; 6)) = —E — log f(X; 0)). (6.4.7) 
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Using (6.4.6) and (6.4.7) together, we obtain 


o? 
Lip = | X;0 6.4.8 
= —B | apg tow (38) (6.4.8) 
Information for a random sample follows in the same way as the scalar case. The 
pdf of the sample is the likelihood function L(@;X). Replace f(X;0) by L(@;X) 
in the vector given in expression (6.4.3). Because log L is a sum, this results in the 
random vector 


Vv log L(0;X) =Syloe f X;;0). (6.4.9) 


i=l 
Because the summands are iid with common covariance matrix I(@), we have 


Cov(v log L(@; X)) = nI(6). (6.4.10) 


As in the scalar case, the information in a random sample of size n is n times the 
information in a sample of size 1. 
The diagonal entries of I(0) are 


Dos FO) _ [2 


T,(0) = Var | 0, ne 


log f(Xi; 6)). 

This is similar to the case when 0 is a scalar, except now J;;(@) is a function of the 
vector @. Recall in the scalar case that (nJ(@))~+ was the Rao-Cramér lower bound 
for an unbiased estimate of 9. There is an analog to this in the multiparameter case. 
In particular, if Y; = u;(X1,...,Xn) is an unbiased estimate of 0;, then it can be 
shown that 


Var(Y;) > - [I-*(6)] (6.4.11) 


ii? 
see, for example, Lehmann (1983). As in the scalar case, we shall call an unbiased 
estimate efficient if its variance attains this lower bound. 


Example 6.4.3 (Information Matrix for the Normal pdf). The log of a N(j,07) 
pdf is given by 


1 1 
log f(z; p,07) = = log 2m — logo — 5g (# —p)’. (6.4.12) 
The first and second partial derivatives are 
Ologf — 1 biel) 
Ou ~ gale 
Plogf — 1 
Op? - a? 
Ologf 1 1 5 
Oo ~ Gg Ag om oH) 
0? log f lt 3 9 
Oo? o? ae =i) 
0? log f 
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Upon taking the negative of the expectations of the second partial derivatives, the 
information matrix for a normal density is 


lines | 7m 2 le (6.4.13) 


We may want the information matrix for (1,07). This can be obtained by taking 
partial derivatives with respect to o? instead of o; however, in Example 6.4.6, 
we obtain it via a transformation. From Example 6.4.1, the maximum likelihood 
estimates of 4 and o? are ji = X and G? = (1/n) Yi, (Xi — X)?, respectively. 
Based on the information matrix, we note that X is an efficient estimate of jy for 
finite samples. In Example 6.4.6, we consider the sample variance. 


Example 6.4.4 (Information Matrix for a Location and Scale Family). Suppose 
X1, Xo,..., Xp is a random sample with common pdf fx(x) = b-'f (5), —0 < 
x < oo, where (a,b) is in the space Q = {(a,b) : —-co <a < co, b> 0} and f(z) isa 
pdf such that f(z) > 0 for —co < z < oo. As Exercise 6.4.10 shows, we can model 
X; as 

Xj =at+ be;, (6.4.14) 


where the e;s are iid with pdf f(z). This is called a location and scale model (LASP). 
Example 6.4.2 illustrated this model when f(z) had the Laplace pdf. In Exercise 
6.4.11, the reader is asked to show that the partial derivatives are 


0 fl L—a 7 1 f' (=) 
es! ()|} = “7 
0 1 ra 
(m5 (S*)/f 
Using (6.4.5) and (6.4.6), we then obtain 
fore) 1 f' (5) 2 1 
24 r—a 
m= [ [ zt ( b ) ae. 


Now make the substitution z = (a — a)/b, dz = (1/b)dx. Then we have 


Live a 2)" peas (6.4.15) 


hence, information on the location parameter a does not depend on a. As Exercise 
6.4.11 shows, upon making this substitution, the other entries in the information 


matrix are 
co / 2 
i. = = / +f = f(z)de (6.4.16) 


te = £ f[f2) soe. (6.4.17) 
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Thus, the information matrix can be written as (1/b)? times a matrix whose entries 
are free of the parameters a and b. As Exercise 6.4.12 shows, the off-diagonal entries 
of the information matrix are 0 if the pdf f(z) is symmetric about 0. m 


Example 6.4.5 (Multinomial Distribution). Consider a random trial which can re- 
sult in one, and only one, of & outcomes or categories. Let X; be 1 or 0 depending 
on whether the jth outcome occurs or does not, for 7 = 1,...,&. Suppose the prob- 
ability that outcome j occurs is p;; hence, ye pj =1. Let X = (X,...,Xx-1)’ 
and p = (pi,...,pr—1)’.. The distribution of X is multinomial; see Section 3.1. 
Recall that the pmf is given by 


k-1 
k-1 k-1 1 Digan 9 
j=l j=l 


where the parameter space is Q={p: 0<p;<1,j=1,...,k-1; jz Pi <i}. 


We first obtain the information matrix. The first partial of the log of f with 
respect to p; simplifies to 


Ologf — xi d= oe vj 
Opi BL By 


The second partial derivatives are given by 


0? log f _ vy 1 ‘yeah vj 
2 — we k— 
Op; PEO - ae pj)? 
2] 1 
a 2st J Gee 
OpiOPn (1 — Doni Ps) 


Recall that for this distribution the marginal distribution of X; is Bernoulli with 
mean p;. Recalling that py = 1—(pi+---+pp—1), the expectations of the negatives 
of the second partial derivatives are straightforward and result in the information 
matrix 


241 1 2s 
Pl P Pk Pk 
ee Me he ys tf 
Pk p2 Pk Pk 
I(p) = (6.4.19) 
1 1 1 1 
Pk Pk Prk-1 a Pk 


This is a patterned matrix with inverse [see page 170 of Graybill (1969)], 


| pi(l — pr) —Ppip2 vee —PiPk—-1 ] 


—pip2 pox(l—p2) -:: —P2Pk—-1 


I-(p) = (6.4.20) 


—P1Pk-1 —p2Pk-1 *** = Pr—1(1 — pe-1) 
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Next, we obtain the mles for a random sample Xj, Xo,...,X,. The likelihood 
function is given by 


k-1 
n k-1 kar \ 1} ai=1 PH 

L(p) =] [ [[ 7" (4-Soa (6.4.21) 
i=1 j= j=l 


Let t; = 0, vi, for j = 1,...,k—1. With simplification, the log of L reduces to 


k-1 k-1 k-1 
\(p) = )° t; log p; + n—- ot; log 1-S p; 
j=l j=l j=l 


The first partial of I(p) with respect to pp, leads to the system of equations 


It is easily seen that p, = t,/n satisfies these equations. Hence the maximum 
likelihood estimates are 


= i 

op a ie Xin h=l,...,k—-1. (6.4.22) 
n 

Each random variable }~"_, Xin, is binomial(n, p;,) with variance np;,(1—pp,). There- 

fore, the maximum likelihood estimates are efficient estimates. m 


As a final note on information, suppose the information matrix is diagonal. Then 
the lower bound of the variance of the jth estimator (6.4.11) is 1/(nI,;(@)). Because 
I,;(@) is defined in terms of partial derivatives [see (6.4.5)] this is the information in 
treating all 0;, except 0;, as known. For instance, in Example 6.4.3, for the normal 
pdf the information matrix is diagonal; hence, the information for j could have 
been obtained by treating o? as known. Example 6.4.4 discusses the information 
for a general location and scale family. For this general family, of which the normal 
is a member, the information matrix is a diagonal matrix if the underlying pdf is 
symmetric. 

In the next theorem, we summarize the asymptotic behavior of the maximum 
likelihood estimator of the vector @. It shows that the mles are asymptotically 
efficient estimates. 


Theorem 6.4.1. Let X1,...,Xn be tid with pdf f(x;@) for @ © Q. Assume the 
regularity conditions hold. Then 


1. The likelihood equation, 


a 
agl(8) = 9, 


has a solution On such that On z 6. 
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2. For any sequence that satisfies (1), 


_~ 


Jn(6, — 8) > N,(0,1-1(0)). 


The proof of this theorem can be found in more advanced books; see, for example, 
Lehmann and Casella (1998). As in the scalar case, the theorem does not assure that 
the maximum likelihood estimates are unique. But if the sequence of solutions are 
unique, then they are both consistent and asymptotically normal. In applications, 
we can often verify uniqueness. 

We immediately have the following corollary, 


Corollary 6.4.1. Let X,...,Xp be tid with pdf f(x; 0) for @ © Q. Assume the reg- 
ularity conditions hold. Let @,, be a sequence of consistent solutions of the likelihood 
equation. Then 0, are asymptotically efficient estimates; that is, for 7 =1,...,p, 


Vn(On,j — 83) = N(0, [I-*()]j3). 


Let g be a transformation g(@) = (g1(@),...,9%(0@))’ such that 1 < k < p and 
that the k x p matrix of partial derivatives 


Og: : . 
B= Senki loa 
Ear v 7 ad ’ +P; 


has continuous elements and does not vanish in a neighborhood of 8. Let 7 = g(0). 
Then 77 is the mle of 7 = g(@). By Theorem 5.4.6, 


Vn(j — 1) 2 N;,(0, BI-!(6)B’). (6.4.23) 


Hence the information matrix for /n(7 — 7) is 


-1 


I(n) = [BI~'(0)B’] (6.4.24) 


provided that the inverse exists. 
For a simple example of this result, reconsider Example 6.4.3. 


Example 6.4.6 (Information for the Variance of a Normal Distribution). Suppose 
X1,...,Xn are iid N(,07). Recall from Example 6.4.3 that the information matrix 
was I(u,0) = diag{o~*, 2077}. Consider the transformation g(u,0) = 07. Hence 


the matrix of partials B is the row vector [0 2c]. Thus the information for 0? is 


The Rao-Cramér lower bound for the variance of an estimator of o? is (20*)/n. 
Recall that the sample variance is unbiased for a”, but its variance is (204)/(n—1). 
Hence, it is not efficient for finite samples, but it is asymptotically efficient. m 
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EXERCISES 


6.4.1. A survey is taken of the citizens in a city as to whether or not they sup- 
port the zoning plan that the city council is considering. The responses are: Yes, 
No, Indifferent, and Otherwise. Let pi, p2,p3, and p4 denote the respective true 
probabilities of these responses. The results of the survey are: 


Tafel ms 
(a) Obtain the mles of p;,i =1,...,4. 
(b) Obtain 95% confidence intervals, (4.2.7), for pi, 7=1,...,4. 


6.4.2. Let X,, Xo,...,X, and Y,, Yo,..., Ym be independent random samples from 
N(01,03) and N(62, 64) distributions, respectively. 


(a) If Q C R? is defined by 
Q = {(01, 02,03) : —00 < 6; < co,i = 1,2;0 < 03 = 04 < oo}, 
find the mles of 01, 02, and 63. 
(b) If 2 Cc R? is defined by 


Q = {(01, 03) : —00 < 0, = 02 < 00;0 < 03 = 04 < oo}, 
find the mles of #; and 63. 


6.4.3. Let X1, Xo,..., Xn be iid, each with the distribution having pdf f(a; 01,02) = 
(1/0)e7 (91/62, 0, <2 < &, —00 < O2 < &, zero elsewhere. Find the maximum 
likelihood estimators of 6, and 6. 


6.4.4. The Pareto distribution is a frequently used model in the study of incomes 
and has the distribution function 


: _ 1 — (0, /x)” 0, <2 
F(x; 01,62) = { 0 elsewhere, 


where 0; > 0 and 62 > 0. If X1, X2,...,Xn is a random sample from this distri- 
bution, find the maximum likelihood estimators of 0, and 62. (Hint: This exercise 
deals with a nonregular case.) 


6.4.5. Let Yj < Yo < --- < Y, be the order statistics of a random sample of 
size n from the uniform distribution of the continuous type over the closed interval 
[0 — p, 0+ p]. Find the maximum likelihood estimators for 6 and p. Are these two 
unbiased estimators? 


6.4.6. Let X1, X2,...,X, be a random sample from N(j, 07). 
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(a) If the constant b is defined by the equation P(X < b) = 0.90, find the mle of 
b. 


(b) If c is given constant, find the mle of P(X < c). 


6.4.7. The data file normal50.rda contains a random sample of size n = 50 for 
the situation described in Exercise 6.4.6. Download this data in R and obtain a 
histogram of the observations. 


(a) In Part (b) of Exercise 6.4.6, let c = 58 and let € = P(X < c). Based on the 
data, compute the estimated value of the mle for €. Compare this estimate 
with the sample proportion, p, of the data less than or equal to 58. 


(b) The R function bootstrapcis64.R computes a bootstrap confidence interval 
for the mle. Use this function to compute a 95% confidence interval for €. 
Compare your interval with that of expression (4.2.7) based on p. 


6.4.8. Consider Part (a) of Exercise 6.4.6. 


(a) Using the data of Exercise 6.4.7, compute the mle of 6. Also obtain the 
estimate based on 90th percentile of the data. 


(b) Edit the R function bootstrapcis64.R to compute a bootstrap confidence 
interval for b. Then run your R function on the data of Exercise 6.4.7 to 
compute a 95% confidence interval for b. 


6.4.9. Consider two Bernoulli distributions with unknown parameters p; and pg. If 
Y and Z equal the numbers of successes in two independent random samples, each 
of size n, from the respective distributions, determine the mles of p; and pz if we 
know that 0 < py < po <1. 


6.4.10. Show that if X; follows the model (6.4.14), then its pdf is b~' f ((a — a)/b). 


6.4.11. Verify the partial derivatives and the entries of the information matrix for 
the location and scale family as given in Example 6.4.4. 


6.4.12. Suppose the pdf of X is of a location and scale family as defined in Example 
6.4.4. Show that if f(z) = f(—z), then the entry [12 of the information matrix is 0. 
Then argue that in this case the mles of a and b are asymptotically independent. 


6.4.13. Suppose X1, X2,...,Xn are iid N(y,07). Show that X; follows a location 
and scale family as given in Example 6.4.4. Obtain the entries of the information 
matrix as given in this example and show that they agree with the information 
matrix determined in Example 6.4.3. 


6.5 Multiparameter Case: Testing 


In the multiparameter case, hypotheses of interest often specify 0 to be in a sub- 
region of the space. For example, suppose X has a N(j,07) distribution. The full 
space is Q = {(,07) : 0? > 0,-00 < pp < oo}. This is a two-dimensional space. 
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We may be interested though in testing that = uo, where po is a specified value. 
Here we are not concerned about the parameter o?. Under Ho, the parameter space 
is the one-dimensional space w = {(fi9,07) : 0? > 0}. We say that Hp is defined 
in terms of one constraint on the space 1). 

In general, let X1,...,X,, be iid with pdf f(x;0) for @¢ QC R?P. As in the last 
section, we assume that the regularity conditions listed in (6.1.1), (6.2.1), (6.2.2), 
and (A.1.1) are satisfied. In this section, we invoke these by the phrase under 
regularity conditions. The hypotheses of interest are 


Ho: 0€w versus Hy: OE NNW, (6.5.1) 


where w C Q is defined in terms of g, 0 < gq < p, independent constraints of 
the form gi(@) = a1,...,9q(@) = aq. The functions g;,...,g, must be continuously 
differentiable. This implies that w is a (p—q)-dimensional space. Based on Theorem 
6.1.1, the true parameter maximizes the likelihood function, so an intuitive test 
statistic is given by the likelihood ratio 

max L(@ 

A= mer gend (6.5.2) 

maxge, L(A) 
Large values (close to 1) of A suggest that Ho is true, while small values indicate 
HA, is true. For a specified level a, 0 < a < 1, this suggests the decision rule 


Reject Ho in favor of Hy if A <c, (6.5.3) 


where c is such that a = maxg,., Pg[A < c]. As in the scalar case, this test often 
has optimal properties; see Section 6.3. To determine c, we need to determine the 
distribution of A or a function of A when Hp is true. 

Let @ denote the maximum likelihood estimator when the parameter space is 
the full space Q and let 09 denote the maximum likelihood estimator when the 


parameter space is the reduced space w. For convenience, define L(Q) =L (6) and 


L(®) =L (60). Then we can write the likelihood ratio test (LRT) statistic as 


Example 6.5.1 (LRT for the Mean of a Normal pdf). Let X1,...,X;, be a random 
sample from a normal distribution with mean p and variance o?. Suppose we are 
interested in testing 

Ho: = po versus Hy: uw # Lo, (6.5.5) 


where jo is specified. Let Q = {(y,07) : —oo < pp < 00,07 > 0} denote the full 
model parameter space. The reduced model parameter space is the one-dimensional 
subspace w = {(j19,07) : 0? > 0}. By Example 6.4.1, the mles of and o? under 
Q are fi = X and G? = (1/n) )7_, (Xi — X)?, respectively. Under 2, the maximum 
value of the likelihood function is 

“a al 


L(Q) = Panacea exp{—(n/2)}. (6.5.6) 
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Following Example 6.4.1, it is easy to show that under the reduced parameter space 
w, F = (1/n) >, (Xi — wo)”. Thus the maximum value of the likelihood function 
under w is 


Hee BaF GRE HPL (nD) (6.5.7) 


The likelihood ratio test statistic is the ratio of L(@) to L(Q); i.e, 


_ (Shal&i = Py" 
a= (Seg) (6.5.8) 


The likelihood ratio test rejects Ho if A < c, but this is equivalent to rejecting Ho 
if A~2/" > c!. Next, consider the identity 


n n 


SOX — pW)? = SG —X)? + n(X — pg)?. (6.5.9) 


i=l i=l 


Substituting (6.5.9) for 3>""_, (X;—u0)?, after simplification, the test becomes reject 
Ho if 


or equivalently, reject Ho if 


Vn(X — Ho) 
wee = AP n= Y 


Let T denote the expression within braces on the left side of this inequality. Then 
the decision rule is equivalent to 


Reject Ho in favor of Hy if |T| > c*, (6.5.10) 


where a = Py,||T| > c*]. Of course, this is the two-sided version of the t-test 
presented in Example 4.5.4. If we take c to be ty/2,,-1, the upper a/2-critical value 
of a t-distribution with n — 1 degrees of freedom, then our test has exact level a. 
The power function for this test is discussed in Section 8.3. 

As discussed in Example 4.2.1, the R call to compute ¢ is t.test (x,mu=mu0), 
where the vector x contains the sample and the scalar mu0 is jig. It also computes 
the t-confidence interval for y. m 


Other examples of likelihood ratio tests for normal distributions can be found 
in the exercises. 

We are not always as fortunate as in Example 6.5.1 to obtain the likelihood 
ratio test in a simple form. Often it is difficult or perhaps impossible to obtain its 
finite sample distribution. But, as the next theorem shows, we can always obtain 
an asymptotic test based on it. 
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Theorem 6.5.1. Let X1,...,Xn be tid with pdf f(x;@) for@€ QC RP. Assume 
the regularity conditions hold. Let On be a sequence of consistent solutions of the 
likelihood equation when the parameter space is the full space Q. Let @on be a 
sequence of consistent solutions of the likelihood equation when the parameter space 
is the reduced space w, which has dimension p—q. Let A denote the likelihood ratio 
test statistic given in (6.5.4). Under Ho, (6.5.1), 


—2logA 3 x(q). (6.5.11) 


A proof of this theorem can be found in Rao (1973). 

There are analogs of the Wald-type and scores-type tests, also. The Wald-type 
test statistic is formulated in terms of the constraints, which define Hp, evaluated 
at the mle under 2. We do not formally state it here, but as the following example 
shows, it is often a straightforward formulation. The interested reader can find a 
discussion of these tests in Lehmann (1999). 

A careful reading of the development of this chapter shows that much of it 
remains the same if X is a random vector. The next example demonstrates this. 


Example 6.5.2 (Application of a Multinomial Distribution). As an example, con- 
sider a poll for a presidential race with k candidates. Those polled are asked to 
select the person for which they would vote if the election were held tomorrow. As- 
suming that those polled are selected independently of one another and that each 
can select one and only one candidate, the multinomial model seems appropriate. 
In this problem, suppose we are interested in comparing how the two “leaders” are 
doing. In fact, say the null hypothesis of interest is that they are equally favorable. 
This can be modeled with a multinomial model that has three categories: (1) and 
(2) for the two leading candidates and (3) for all other candidates. Our observa- 
tion is a vector (X1, X2), where X; is 1 or 0 depending on whether category i is 
selected or not. If both are 0, then category (3) has been selected. Let p; denote the 
probability that category i is selected. Then the pmf of (X1, X2) is the trinomial 
density, 

f (v1, £2391, D2) = pps? (1 — pr — p2)' 3”, (6.5.12) 
for a; = 0,1,2 = 1,2;a1 + 22 < 1, where the parameter space is Q = {(p1,p2): 0< 
pi <1, pi t+ pe < 1}. Suppose (X41, X21),.--, (Xin, X2n) is a random sample from 
this distribution. We shall consider the hypotheses 


Ho: py = p2 versus H, : py # po. (6.5.13) 


We first derive the likelihood ratio test. Let Tj = 0;_, Xj; for j = 1,2. From 
Example 6.4.5, we know that the maximum likelihood estimates are p; = T;/n, for 
j = 1,2. The value of the likelihood function (6.4.21) at the mles under (2 is 


L (&) = pr'py” (1 — Bi — pa)", 


Under the null hypothesis, let p be the common value of p; and pz. The pmf of 
(Xi, Xp) is 


f (zi, 29; p) = p™ 772 (1— 2p)"; 1,22 = 0,1; 21 +22 <1, (6.5.14) 
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where the parameter space is w = {p: 0 < p< 1/2}. The likelihood under w is 
Lp) =p"? (1 — 2p). (6.5.15) 


Differentiating log L(p) with respect to p and setting the derivative to 0 results in 
the folowing maximum likelihood estimate, under w: 


~ _titt,  Pitpe 
Po on 2 ’ 


(6.5.16) 


where p, and p2 are the mles under (2. The likelihood function evaluated at the mle 
under w simplifies to 


- ~ \ n(pitpe) 
: ao ee 
L(@) = (2 5 2) (1 — pi — po) (1-Pi— Ba), (6.5.17) 


The reciprocal of the likelihood ratio test statistic then simplifies to 


~ npr ~ npr 
A= ee ) as ) (6.5.18) 
Pi + p2 Pi + p2 
Based on Theorem 6.5.11, an asymptotic level a test rejects Ho if 2log A~! > x2(1). 
This is an example where the Wald’s test can easily be formulated. The con- 
straint under Ho is p, — po = 0. Hence, the Wald-type statistic is W = p, — po, 
which can be expressed as W = [1, —1][pj ;p2]’. Recall that the information matrix 


and its inverse were found for k categories in Example 6.4.5. From Theorem 6.4.1, 
we then have 


| . | is approximately N2 (( Pl ) = | pill =i) yee |): (6.5.19) 
2 


P2 —Ppip2 p2(1 — pe) 


As shown in Example 6.4.5, the finite sample moments are the same as the asymp- 
totic moments. Hence the variance of W is 


ve) = a ey [| 
Pi + po ~ (Pi — pa)” 


Because W is asymptotically normal, an asymptotic level a test for the hypotheses 
(6.5.13) is to reject Ho if yj > v2,(1), where 


2 (Pi = D2)? 
Xw = TS ee : 6.5.20 
"(Bi + Bz — (Pi — Ba)?)/n : ! 
It also follows that an asymptotic (1 —«a@)100% confidence interval for the difference 
Pi — p2 is 


oe ee 1/2 
x + 2 — (p1 — B2)? 
Di — Po + 2a/2 ns . (6.5.21) 
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Returning to the polling situation discussed at the beginning of this example, we 
would say the race is too close to call if 0 is in this confidence interval. 

Equivalently, the test can be based on the test statistic z = vice which has 
an asymptotic N(0,1) distribution under Ho. This form of the test and the confi- 
dence interval for p; — pz are computed by the R function p2pair.R, which can be 
downloaded at the site mentioned in the Preface. m 


Example 6.5.3 (Two-Sample Binomial Proportions). In Example 6.5.2, we devel- 
oped tests for p; = pz based on a single sample from a multinomial distribution. 
Now consider the situation where X1, X2,..., Xn, isarandom sample from a (1, p1) 
distribution, Y1, Y2,..., Yn, is arandom sample from a b(1, p2) distribution, and the 
X;s and Y;s are mutually independent. The hypotheses of interest are 


Ho: py = p2 versus H, : p, # po. (6.5.22) 


This situation occurs in practice when, for instance, we are comparing the pres- 
ident’s rating from one month to the next. The full and reduced model param- 
eter spaces are given respectively by Q = {(pi,p2) : 0 < pi < 1,i = 1,2} and 
w = {(p,p):0<p< 1}. The likelihood function for the full model simplifies to 


Ny—nye eae 


L(p1,p2) = pt** (1 — pi) Py — pg)??—"9, (6.5.23) 


It follows immediately that the mles of p; and po are % and JY, respectively. Note, 
for the reduced model, that we can combine the samples into one large sample from 
a b(n, p) distribution, where n = n; + ng is the combined sample size. Hence, for 
the reduced model, the mle of p is 


ni sa m2 . — — 
iat Vit iar Yi _— ME + ny (6.5.24) 


SS eS SS ee 
ny + ng mr 


i.e., a weighted average of the individual sample proportions. Using this, the reader 
is asked to derive the LRT for the hypotheses (6.5.22) in Exercise 6.5.12. We next 
derive the Wald-type test. Let p, = % and pz = Y. From the Central Limit Theorem, 


we have m 
Jni(Di — pi) D 
pi(l — pi) 
where Z, and Z are iid N(0,1) random variables. Assume for i = 1,2 that, as 
n— oo, ni/n > Ay, where 0 < A; < 1 and Ay + Ag = 1. As Exercise 6.5.13 shows, 


Zi; t= 1,2, 


Vnl(P: — G2) — (p: — p2)] 2 N (0. prt — mn) + spall —p2)). (6.5.25) 


It follows that the random variable 
(D1 — D2) — (pi — pe) (6.5.26) 


pidl~p1) | p2(1~p2) 
Eo ee 


Z= 


has an approximate N(0,1) distribution. Under Ho, p; — pz = 0. We could use Z 
as a test statistic, provided we replace the parameters p;(1— pi) and po(1 — po) 
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in its denominator with a consistent estimate. Recall that p; — p;, i = 1,2, in 
probability. Thus under Ho, the statistic 


Pi — Pe 
Pid—Pi) , D2(1-P2) 
1 = oy = 2 


Zt= (6.5.27) 


has an approximate N(0,1) distribution. Hence an approximate level a test is 
to reject Ho if |z*| > zag. Another consistent estimator of the denominator is 
discussed in Exercise 6.5.14. m 


EXERCISES 


6.5.1. On page 80 of their test, Hollander and Wolfe (1999) present measurements 
of the ratio of the earth’s mass to that of its moon that were made by 7 different 
spacecraft (5 of the Mariner type and 2 of the Pioneer type). These measurements 
are presented below (also in the file earthmoon.rda). Based on earlier Ranger 
voyages, scientists had set this ratio at 81.3035. Assuming a normal distribution, 
test the hypotheses Hp : uw = 81.3035 versus Hy : uw # 81.3035, where pu is the 
true mean ratio of these later voyages. Using the p-value, conclude in terms of the 
problem at the nominal a-level of 0.05. 


Earth to Moon Mass Ratios 
81.3001 | 81.3015 | 81.3006 | 81.3011 | 81.2997 | 81.3005 | 81.3021 


6.5.2. Obtain the boxplot of the data in Exercise 6.5.1. Mark the value 81.3035 on 
the plot. Compute the 95% confidence interval for ju, (4.2.3), and mark its endpoints 
on the plot. Comment. 


6.5.3. Consider the survey of citizens discussed in Exercise 6.4.1. Suppose that the 
hypotheses of interest are Ho : py = p2 versus Hy : py ~ po. Note that computation 
can be carried out using the R function p2pair.R, which can be downloaded at the 
site mentioned in the Preface. 


(a) Test these hypotheses at level a = 0.05 using the test (6.5.20). Conclude in 
terms of the problem. 


(b) Obtain the 95% confidence interval, (6.5.21), for p: — po. What does the 
confidence interval mean in terms of the problem? 


6.5.4. Let X1,X2,...,X, be a random sample from the distribution N (61, 2). 
Show that the likelihood ratio principle for testing Hp : 02 = 65 specified, and 6; 
unspecified against H; : 02 4 04, 6; unspecified, leads to a test that rejects when 
yi (zi — F)? < cy or OY (a; — FZ)? > ce, where ci < cz are selected appropriately. 


6.5.5. Let X1,...,X, and Yj,..., Ym be independent random samples from the 
distributions N(61, 63) and N(02, 64), respectively. 
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(a) Show that the likelihood ratio for testing Ho : 01 = 02, 03 = 64 against all 
1 1 


alternatives is given by 
n n/2 m m/2 
pe - 2/9 pa - Hn] 
(Pn mf em) 72 
Se wP + Ste} [tm em} 


where u = (n¥ + my)/(n +m). 


(b) Show that the likelihood ratio test for testing Ho : 63 = 64, 6; and 62 unspec- 
ified, against H; : 03 # 04, 0, and 02 unspecified, can be based on the random 


variable 
n 


YK — XP /(n- 1) 


m 


S°(% -— Y)?/(m— 1) 

1 
6.5.6. Let X1, Xo,..., Xn and Yj, Yo,..., Ym be independent random samples from 
the two normal distributions N (0,61) and N(0, 62). 


(a) Find the likelihood ratio A for testing the composite hypothesis Ho : 6; = 02 
against the composite alternative H, : 0, 4 02. 


(b) This A is a function of what F-statistic that would actually be used in this 
test? 


6.5.7. Let X and Y be two independent random variables with respective pdfs 


Ane = (A) er! 0 <2 < 00,0 <6; <00 
= 0 elsewhere, 


for i = 1,2. To test Ho : 6, = 02 against Hy : 0; 4 02, two independent samples 
of sizes n, and nz, respectively, were taken from these distributions. Find the 
likelihood ratio A and show that A can be written as a function of a statistic having 
an F-distribution, under Ho. 


6.5.8. For a numerical example of the F-test derived in Exercise 6.5.7, here are 
two generated data sets. The first was generated by the R call rexp(10,1/20), 
i.e., 10 observations from a [(1,20)-distribution. The second was generated by 
rexp(12,1/40). The data are rounded and can also be found in the file genexpd. rda. 


(a) Obtain comparison boxplots of the data sets. Comment. 


(b) Carry out the F-test of Exercise 6.5.7. Conclude in terms of the problem at 
level 0.05. 

x? 14.114. FA2 

y: 55.6 40.5 32 


ra 
iS 


.f 56.1 3.3 2.6 


of 9.6 71.61 
.7 25.6 70.6 1.4 51.5 12.6 16.9 63.3 5.6 66.7 


(oe) 
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6.5.9. Consider the two uniform distributions with respective pdfs 


ie _ o 0; <2 < 6;,-~ <0; < CO 

F(a; 04) = { 0 elsewhere, 

fori = 1,2. The null hypothesis is Hp : 0; = 02, while the alternative is Hy : 0; 4 02. 
Let X1 < Xqg < +--+ < Xy, and Yi < Yo < --- < Y,, be the order statistics of two 
independent random samples from the respective distributions. Using the likelihood 
ratio A, find the statistic used to test Ho against H,. Find the distribution of 
—2logA when Hp is true. Note that in this nonregular case, the number of degrees 
of freedom is two times the difference of the dimensions of Q and w. 


6.5.10. Let (X1,¥1), (Xo, Yo),..-,; (Xn, Yn) ee a nda sample from a PWarate 
normal distribution with juz, p12, giz = of = 07, p = i, where j41, 2, and a7 > 0 ae 
unknown real numbers. Find the likelihood ratio A foi testing Ho : 1 = W2 = 0, o? 
unknown against all alternatives. The likelihood ratio A is a function of what 
statistic that has a well-known distribution? 


6.5.11. Let nm independent trials of an experiment be such that 2),2%2,...,2,% are 
the respective numbers of times that the experiment ends in the mutually exclusive 
and exhaustive events C),C2,...,C,. If p; = P(C;) is constant throughout the n 


C1 2 


trials, then the probability of that particular sequence of trials is L = pj'p3?---p;,*. 


(a) Recalling that p; + pe +---+ px = 1, show that the likelihood ratio for testing 
Ao: pi = pio > 0, 7= 1,2,...,k, against all alternatives is given by 


(b) Show that 


where p/ is between p;o and 2;/n. 
Hint: Expand log pio in a Taylor’s series with the remainder in the term 
involving (pig — 2; /n)?. 


(c) For large n, argue that x;/(np;)? is approximated by 1/(np;o) and hence 
—2logA = yo = "Pio)” when Hp is true. 
NPio 


Theorem 6.5.1 says that the right-hand member of this last equation defines 
a statistic that has an approximate chi-square distribution with k — 1 degrees 
of freedom. Note that 


dimension of 2 — dimension of w = (k —1)-O0=k-—-1. 


404 Maximum Likelihood Methods 


6.5.12. Finish the derivation of the LRT found in Example 6.5.3. Simplify as much 
as possible. 


6.5.13. Show that expression (6.5.25) of Example 6.5.3 is true. 


6.5.14. As discussed in Example 6.5.3, Z, (6.5.27), can be used as a test statistic 
provided that we have consistent estimators of p;(1 — pi) and po(1 — p2) when Ho 
is true. In the example, we discussed an estimator that is consistent under both Ho 
and H,. Under Ho, though, pi(1 —p1) = pe(1 — pe) = p(1 — p), where p = p; = po. 
Show that the statistic (6.5.24) is a consistent estimator of p, under Hp. Thus 
determine another test of Ho. 


6.5.15. A machine shop that manufactures toggle levers has both a day and a night 
shift. A toggle lever is defective if a standard nut cannot be screwed onto the threads. 
Let p; and pz be the proportion of defective levers among those manufactured by the 
day and night shifts, respectively. We shall test the null hypothesis, Ho : py = po, 
against a two-sided alternative hypothesis based on two random samples, each of 
1000 levers taken from the production of the respective shifts. Use the test statistic 
Z* given in Example 6.5.3. 


(a) Sketch a standard normal pdf illustrating the critical region having a = 0.05. 


(b) If y: = 37 and y2 = 53 defectives were observed for the day and night shifts, 
respectively, calculate the value of the test statistic and the approximate p- 
value (note that this is a two-sided test). Locate the calculated test statistic 
on your figure in part (a) and state your conclusion. Obtain the approximate 
p-value of the test. 


6.5.16. For the situation given in part (b) of Exercise 6.5.15, calculate the tests 
defined in Exercises 6.5.12 and 6.5.14. Obtain the approximate p-values of all three 
tests. Discuss the results. 


6.6 The EM Algorithm 


In practice, we are often in the situation where part of the data is missing. For 
example, we may be observing lifetimes of mechanical parts that have been put 
on test and some of these parts are still functioning when the statistical analysis is 
carried out. In this section, we introduce the EM algorithm, which frequently can be 
used in these situations to obtain maximum likelihood estimates. Our presentation 
is brief. For further information, the interested reader can consult the literature in 
this area, including the monograph by McLachlan and Krishnan (1997). Although, 
for convenience, we write in terms of continuous random variables, the theory in 
this section holds for the discrete case as well. 

Suppose we consider a sample of n items, where n; of the items are observed, 
while ng = n — n; items are not observable. Denote the observed items by X’ = 
(X1, Xo,...,Xn,) and unobserved items by Z’ = (Z1, Zo,...,Zn,). Assume that 
the X;s are iid with pdf f(«|@), where 9 ¢ Q. Assume that the Z;s and the X;s are 


6.6. The EM Algorithm 405 


mutually independent. The conditional notation will prove useful here. Let g(x|0) 
denote the joint pdf of X. Let h(x,z|@) denote the joint pdf of the observed and 
unobserved items. Let k(z|0,x) denote the conditional pdf of the missing data given 
the observed data. By the definition of a conditional pdf, we have the identity 


(6.6.1) 


The observed likelihood function is L(0|x) = g(x|0). The complete likelihood 
function is defined by 
L°(6|x, z) = h(x, 2|0). (6.6.2) 


Our goal is maximize the likelihood function L(6|x) by using the complete likelihood 
L°(6|x,z) in this process. 
Using (6.6.1), we derive the following basic identity for an arbitrary but fixed 
A €Q: 
log L(@|x) = [8260 b(2160, 2) dz 
= | roeg(x\8)k(2|60,>) de 
[vs h(x, 2|9) — log k(z|@,x)|k(z|90, x) dz 


J eathl 2)0)1b(2160,20) da — f los[k( a0, x)](zl60, x) dz 
Eo, [log L°(8|x, Z)|00, x] — Eg, [log k(Z|0, x)|00, x], (6.6.3) 


l| 


where the expectations are taken under the conditional pdf k(z|09,x). Define the 
first term on the right side of (6.6.3) to be the function 


Q(6|09,x) = Es, [log L“(8|x, Z)|O0, x]. (6.6.4) 


The expectation that defines the function @ is called the EF step of the EM algorithm. 
Recall that we want to maximize log L(6|x). As discussed below, we need only 
maximize Q(4|9o, x). This maximization is called the M step of the EM algorithm. 

Denote by 6) an initial estimate of 6, perhaps based on the observed likelihood. 
Let 6) be the argument that maximizes Q(6|0),x). This is the first-step estimate 
of 9. Proceeding this way, we obtain a sequence of estimates ar”). We formally 
define this algorithm as follows: 


Algorithm 6.6.1 (EM Algorithm). Let 6™ denote the estimate on the mth step. 
To compute the estimate on the (m+ 1)st step, do 


1. Expectation Step: Compute 
Q(0|0™ , x) = Exum [log L°(8|x, Z)|Om 5 x]; (6.6.5) 


where the expectation is taken under the conditional pdf k(z|O0™), x). 
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2. Maximization Step: Let 


A") = ArgmarQ(o|o™, x). (6.6.6) 


Under strong assumptions, it can be shown that gim) converges in probability 
to the maximum likelihood estimate, as m — oo. We will not show these results, 
but as the next theorem shows, gim-+1) always increases the likelihood over gem), 


Theorem 6.6.1. The sequence of estimates gm) defined by Algorithm 6.6.1, sat- 
isfies 

LO") |x) > £0 |x). (6.6.7) 
Proof: Because 6°"+)) maximizes Q(6|6™, x), we have 


Q(an+h) ja), x) > Qe ja) x); 


that is, 
Egy [log L°(80"+Y |x, Z)] > Egimy [log L°(6™ |x, Z)], (6.6.8) 


where the expectation is taken under the pdf k(z|0(™), x). By expression (6.6.3), 
we can complete the proof by showing that 


Exum [log k(ZIOC"*Y , x)] < Eq [log k(Z|O™ , x)]. (6.6.9) 


Keep in mind that these expectations are taken under the conditional pdf of Z given 
a) and x. An application of Jensen’s inequality, (1.10.5), yields 
k(Z\aer+)) ; x) | 


k(Z|0r+) 
EGum) log es) < log Eg(m) a 
k(Z|9™) , x) k(Z|a0C™, x) 


k gim+l)_ x 
og | RUaIPT 3) egw, sce 
k(z|a(™, x) 


log(1) = 0. (6.6.10) 


l| 


This last result establishes (6.6.9) and, hence, finishes the proof. m 


As an example, suppose X1, X9,...,Xn, are iid with pdf f(a — 6), for —co < 
x < co, where —oo < 6 < co. Denote the cdf of X; by F(a—0). Let 2, Z2,..., Zn, 
denote the censored observations. For these observations, we only know that Z; > a, 
for some a that is known, and that the Z7;s are independent of the X;s. Then the 
observed and complete likelihoods are given by 


L(6|x) [1 — F(a mT f (xi (6.6.11) 


L°(6\x,z) = sie_oftien (6.6.12) 
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By expression (6.6.1), the conditional distribution Z given X is the ratio of (6.6.12) 
o (6.6.11); that is, 


[aa f= 8) Ti Fa — 8) 
k(z|@,x) = fl—-Fa—6)"2 ITs (x; — 8) 


= [l-F(a-6 rm Tre a< 2,1=1,...,n2(6.6.13) 
Thus, Z and X are independent, and Z,...,Z,, are iid with the common pdf 


f(z — 6)/[1 — F(a — 9)], for z > a. Based on these observations and expression 
(6.6.13), we have the following derivation: 


Q(l80,x) =  Es,[log L°(4|x, Z)] 
doe Su — 0) +) Vlog f(Zi — 9) 


i=l 


= Yo i — 0) + n2Boqllog f(Z — 9)] 


= > ios f (ai - 
i=l 


+ng ia log f(z — SS dz. (6.6.14) 


This last result is the E step of the EM algorithm. For the M step, we need the 
partial derivative of Q(6|90,x) with respect to @. This is easily found to be 


ef yee pom [a eG fea Ae}. (6.6.15) 


Assuming that 09 = 80, the first-step EM estimate would be the value of 0, say go), 
which solves ee = 0. In the next example, we obtain the solution for a normal 
model. 


Example 6.6.1. Assume the censoring model given above, but now assume that 
X has a N(6,1) distribution. Then f(x) = ¢(x) = (27)~/? exp{—a?/2}. It is easy 
to show that f’(x)/f(x) = —a. Letting ®(z) denote, as usual, the cdf of a standard 
normal random variable, by (6.6.15) the partial derivative of Q(6|00, x) with respect 
to @ for this model simplifies to 


og _ eh = 1 exp{—(z—bo)2/2} 
30 7 Siei- 0) +2 f (= sane) @ 
_1_exp{—(z — 60)*/2} Bo 
V2n 1-—(a—4) 
d(a = 60) <= no(6 i 60). 


= n(Z — 6) +n (266) n2(@ — 00) 


n2 


1 @(a— 0) 
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Solving 0Q/0@ = 0 for 6 determines the EM step estimates. In particular, given 
that 6°” is the EM estimate on the mth step, the (m+ 1)st step estimate is 
pres epi Oe ca (6.6.16) 
n n n 1— O(a- 9™) 


where n =n, +79. 


For our second example, consider a mixture problem involving normal distribu- 
tions. Suppose Y; has a N(1,07) distribution and Y2 has a N(j12, 03) distribu- 
tion. Let W be a Bernoulli random variable independent of Y; and Y2 and with 
probability of success e = P(W = 1). Suppose the random variable we observe 
is X = (l1—W)Y, + WY.. In this case, the vector of parameters is given by 
0’ = (111, 12,01,02,€). As shown in Section 3.4, the pdf of the mixture random 
variable X is 

f(a) = (1—-)fi(x) + €fo(a%), —0oo <4 <0, (6.6.17) 


where f;(x) = ae [((a — p;)/o;], 9 = 1,2, and ¢(z) is the pdf of a standard normal 
random variable. Suppose we observe a random sample X’ = (X1, Xo,...,X,) from 
this mixture distribution with pdf f(x). Then the log of the likelihood function is 


1(@|x) = in 1 —)fi(zi) + €fo(z)]. (6.6.18) 


w=1 


In this mixture problem, the unobserved data are the random variables that 
identify the distribution membership. For i = 1,2,...,n, define the random vari- 


ables 
es 1 if X; has pdf fo(a). 


These variables, of course, constitute the random sample on the Bernoulli random 
variable W. Accordingly, assume that W,,W2,...,W,» are iid Bernoulli random 
variables with probability of success e«. The complete likelihood function is 


L°(6|x, w) = He [| fez. 
Wi=1 


Hence the log of the complete likelihood function is 


I(@|x,w) = y log fi(z a log f(z 


710 — w;) log f(x) + w; log fo(xi)]. (6.6.19) 


For the E step of the algorithm, we need the conditional expectation of W; given x 
under 69; that is, 
Eg,|Wil90, x] = P[Wi = 1/40, x]. 
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An estimate of this expectation is the likelihood of x; being drawn from distribution 
fo(x), which is given by 


4 = €f2,0(#:) 
‘(1 =€)fi,o(@i) + €f2,0(ai)’ 


where the subscript 0 signifies that the parameters at 89 are being used. Expression 
(6.6.20) is intuitively evident; see McLachlan and Krishnan (1997) for more discus- 
sion. Replacing w; by 7; in expression (6.6.19), the M step of the algorithm is to 
maximize 


(6.6.20) 


n 


Q(0|80,x) = So[(1 — 74) log fa (wi) + 7 log fo(as)]. (6.6.21) 


i=1 
This maximization is easy to obtain by taking partial derivatives of Q(0|@0, x) with 
respect to the parameters. For example, 


aa (-1/20f)(—2)(ai — 1). 


ae i=] 


Setting this to 0 and solving for jz, yields the estimate of 1. The estimates of the 
other mean and the variances can be obtained similarly. These estimates are 


~ wie (l = Ya) %i 


== Yat) 
a Dini (l — Vi) (@i — fh)? 
via (1 — %) 
fg = Dies HE 
Doin Vi 
@ - Dhyne 
doin Vi 
Since 7; is an estimate of P[W; = 1|@0,x], the average n~' >", 1 is an estimate 


of e = P[W; = 1]. This average is our estimate of €. 


EXERCISES 


6.6.1. Rao (page 368, 1973) considers a problem in the estimation of linkages in 
genetics. McLachlan and Krishnan (1997) also discuss this problem and we present 
their model. For our purposes, it can be described as a multinomial model with the 
four categories Cy, C2, C3, and Cy. For a sample of size n, let X = (X1, X2, X3, X4)! 
denote the observed frequencies of the four categories. Hence, n = aan X;. The 
probability model is 
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where the parameter @ satisfies 0 < 6 < 1. In this exercise, we obtain the mle of 0. 


(a) Show that likelihood function is given by 


L(6|x) = —__ E + zo] F : rd ™ Fa " 6.6.22) 


x1!x9!a3!r4! 2 4 4 4 4 


(b) Show that the log of the likelihood function can be expressed as a constant 
(not involving parameters) plus the term 


x1 log[2 + 6] + [x2 + x3] log[1 — 0] + x4 log 0. 


(c) Obtain the partial derivative with respect to @ of the last expression, set the 
result to 0, and solve for the mle. (This will result in a quadratic equation 
that has one positive and one negative root.) 


6.6.2. In this exercise, we set up an EM algorithm to determine the mle for the 
situation described in Exercise 6.6.1. Split category C into the two subcategories 
Cy, and Ci with probabilities 1/2 and 6/4, respectively. Let 71; and Z12 denote 
the respective “frequencies.” Then X; = Z1; + Z12. Of course, we cannot observe 
Zi and Z\2. Let Z = (Z11, Z12)'. 


(a) Obtain the complete likelihood L°(6|x, z). 


(b) Using the last result and (6.6.22), show that the conditional pmf k(z|0,x) is 
binomial with parameters x; and probability of success 0/(2 + 6). 


(c) Obtain the E step of the EM algorithm given an initial estimate 6) of 0. 
That is, obtain 


Q(0|0 , x) = Exo, [log L°(A|x, Z) | , x}. 


Recall that this expectation is taken using the conditional pmf k(2|0), x). 
Keep in mind the next step; i.e., we need only terms that involve @. 


(d) For the M step of the EM algorithm, solve the equation dQ (0\0, x)/00 = 0. 
Show that the solution is 


A) £10) + Qa4 + 249 


2 (6.6.23) 
nO) + (ae +23 + 24) 


6.6.3. For the setup of Exercise 6.6.2, show that the following estimator of @ is 
unbiased: 


Osa (0 = X= Mee). (6.6.24) 


6.6.4. Rao (page 368, 1973) presents data for the situation described in Exercise 
6.6.1. The observed frequencies are x = (125, 18, 20,34)’. 


(a) Using computational packages (for example, R), with (6.6.24) as the initial 
estimate, write a program that obtains the stepwise EM estimates 0“). 
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(b) Using the data from Rao, compute the EM estimate of @ with your program. 
List the sequence of EM estimates, {OR}, that you obtained. Did your sequence 
of estimates converge? 


(c) Show that the mle using the likelihood approach in Exercise 6.6.1 is the pos- 
itive root of the equation 1976? — 150 — 68 = 0. Compare it with your EM 
solution. They should be the same within roundoff error. 


6.6.5. Suppose X1, Xo,...,Xn, is a random sample from a N(@,1) distribution. 
Besides these n; observable items, suppose there are ng missing items, which we 
denote by 21, Z2,...,Zn,. Show that the first-step EM estimate is 


nF + 29 


Pi 


90) — 


n 


where 6) is an initial estimate of @ and n =n; + no. Note that if 9 = %, then 
O\*) = & for all k. 


6.6.6. Consider the situation described in Example 6.6.1. But suppose we have left 
censoring. That is, if 21, Z2,...,Zn, are the censored items, then all we know is 
that each Z; < a. Obtain the EM algorithm estimate of 0. 


6.6.7. Suppose these data follow the model of Example 6.6.1: 


2.01 0.74 068 150° U4? 150° 1.50" “1.52 
0.07 —0.04 —0.21 0.05 —0.09 0.67 0.14 


where the superscript * denotes that the observation was censored at 1.50. Write 
a computer program to obtain the EM algorithm estimate of 6. 


6.6.8. The following data are observations of the random variable X = (1—W)Y,+ 
WY2, where W has a Bernoulli distribution with probability of success 0.70; Yj 
has a N(100, 202) distribution; Y2 has a N(120,25°) distribution; W and Y; are 
independent; and W and Y2 are independent. Data are in the file mix668.rda. 


119.0 96.0 146.2 138.6 143.4 98.2 124.5 
114.1 136.2 1386.4 184.8 79.8 151.9 114.2 
145.7 95.9 97.3 136.4 109.2 103.2 


Program the EM algorithm for this mixing problem as discussed at the end of the 
section. Use a dotplot to obtain initial estimates of the parameters. Compute the es- 
timates. How close are they to the true parameters? Note: assuming the R vector x 
contains the sample on X, a quick dotplot in R is computed by plot (rep(1, 20) ~x). 
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Chapter 7 


Sufficiency 


7.1 Measures of Quality of Estimators 


In Chapters 4 and 6 we presented procedures for finding point estimates, interval 
estimates, and tests of statistical hypotheses based on likelihood theory. In this 
and the next chapter, we present some optimal point estimates and tests for certain 
situations. We first consider point estimation. 

In this chapter, as in Chapters 4 and 6, we find it convenient to use the letter 
f to denote a pmf as well as a pdf. It is clear from the context whether we are 
discussing the distributions of discrete or continuous random variables. 

Suppose f(2;@) for 9 € Q is the pdf (pmf) of a continuous (discrete) random 
variable X. Consider a point estimator Y, = u(X1,...,Xn) based on a sample 
Xj ,...,Xn. In Chapters 4 and 5, we discussed several properties of point estimators. 
Recall that Y,, is a consistent estimator (Definition 5.1.2) of 6 if Y, converges to 
@ in probability; i.e., Y;,, is close to 6 for large sample sizes. This is definitely a 
desirable property of a point estimator. Under suitable conditions, Theorem 6.1.3 
shows that the maximum likelihood estimator is consistent. Another property was 
unbiasedness (Definition 4.1.3), which says that Y, is an unbiased estimator of 0 
if E(Y,) = 0. Recall that maximum likelihood estimators may not be unbiased, 
although generally they are asymptotically unbiased (see Theorem 6.2.2). 

If two estimators of # are unbiased, it would seem that we would choose the one 
with the smaller variance. This would be especially true if they were both approx- 
imately normal because the one with the smaller asymptotic variance (and hence 
asymptotic standard error) would tend to produce shorter asymptotic confidence 
intervals for 6. This leads to the following definition: 


Definition 7.1.1. For a given positive integer n, Y = u(X1, Xa,...,Xn) is called 
a@ minimum variance unbiased estimator (MVUE) of the parameter 0 if Y is 
unbiased, that is, E(Y) = 0, and if the variance of Y is less than or equal to the 
variance of every other unbiased estimator of 0. 


Example 7.1.1. As an illustration, let X,,X2,...,X9 denote a random sample 
from a distribution that is N(0,07), where —oo < 6 < oo. Because the statistic 
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X = (X14+-Xo+---+Xo)/Iis N(O, zy, X is an unbiased estimator of @. The statistic 
X, is N(0,07), so Xj is also an unbiased estimator of 9. Although the variance 
7 of X is less than the variance o? of X1, we cannot say, with n = 9, that X is 
the minimum variance unbiased estimator (MVUE) of 0; that definition requires 
that the comparison be made with every unbiased estimator of 6. To be sure, it is 
quite impossible to tabulate all other unbiased estimators of this parameter 0, so 
other methods must be developed for making the comparisons of the variances. A 
beginning on this problem is made in this chapter. m 


Let us now discuss the problem of point estimation of a parameter from a slightly 
different standpoint. Let X 1, X2,...,X, denote a random sample of size n from a 
distribution that has the pdf f(a;6), 6 € Q. The distribution may be of either the 
continuous or the discrete type. Let Y = u(X1, X2,..., Xn) be a statistic on which 
we wish to base a point estimate of the parameter 6. Let d(y) be that function of 
the observed value of the statistic Y which is the point estimate of #6. Thus the 
function 6 decides the value of our point estimate of 0 and 6 is called a decision 
function or a decision rule. One value of the decision function, say d(y), is called 
a decision. Thus a numerically determined point estimate of a parameter @ is a 
decision. Now a decision may be correct or it may be wrong. It would be useful to 
have a measure of the seriousness of the difference, if any, between the true value 
of 6 and the point estimate d(y). Accordingly, with each pair, [0,d(y)], 0 € Q, we 
associate a nonnegative number L[0, 6(y)] that reflects this seriousness. We call the 
function £ the loss function. The expected (mean) value of the loss function is 
called the risk function. If fy(y;@), @ € Q, is the pdf of Y, the risk function 
R(6,6) is given by 


R(O,6) = B{L. 5} = f £1, d1y)1Fr (v0) dy 


if Y is a random variable of the continuous type. It would be desirable to select a 
decision function that minimizes the risk R(0,6) for all values of 0, 0 € Q. But this 
is usually impossible because the decision function 6 that minimizes R(0,6) for one 
value of 9 may not minimize R(6,5) for another value of 6. Accordingly, we need 
either to restrict our decision function to a certain class or to consider methods of 
ordering the risk functions. The following example, while very simple, dramatizes 
these difficulties. 


Example 7.1.2. Let X,, X2,..., X25 be arandom sample from a distribution that 
is N(0,1), for —coo < 0 < oo. Let Y = X, the mean of the random sample, and 
let LO, 6(y)] = [0 — 5(y)|?. We shall compare the two decision functions given by 
di(y) = y and 62(y) = 0 for —oo < y < oo. The corresponding risk functions are 
R(O,51) = E[(9-Y)"] = gs 


and 


R(6, 62) = E[(@ — 0)?] = @. 
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If, in fact, @ = 0, then d2(y) = 0 is an excellent decision and we have R(0, 62) = 0. 
However, if 6 differs from zero by very much, it is equally clear that 62 = 0 is a poor 
decision. For example, if, in fact, 6 = 2, R(2,d2) =4 > R(2,01) = oat In general, 
we see that R(@,d2) < R(0,61), provided that —+ < 6 < 4, and that otherwise 
R(O,62) > R(6, 61). That is, one of these decision functions is better than the other 
for some values of 8, while the other decision function is better for other values of 
0. If, however, we had restricted our consideration to decision functions 6 such that 
E[6(Y)] = @ for all values of 6, 6 € Q, then the decision function 62(y) = 0 is not 
allowed. Under this restriction and with the given £0, 6(y)], the risk function is the 
variance of the unbiased estimator d(Y), and we are confronted with the problem of 
finding the MVUE. Later in this chapter we show that the solution is d(y) = y = %. 
Suppose, however, that we do not want to restrict ourselves to decision functions 
6, such that E[6(Y)] = @ for all values of 6, @ € Q. Instead, let us say that 
the decision function that minimizes the maximum of the risk function is the best 
decision function. Because, in this example, R(0,62) = 6? is unbounded, d2(y) = 0 
is not, in accordance with this criterion, a good decision function. On the other 
hand, with —oo < 6 < ov, we have 
maxh(4, 51) = max( 35) =x. 
Accordingly, 61(y) = y = Z seems to be a very good decision in accordance with 
this criterion because s is small. As a matter of fact, it can be proved that 6, is 
the best decision function, as measured by the minimax criterion, when the loss 
function is L[A, 6(y)] = [6 — 6(y)|?. = 


In this example we illustrated the following: 


1. Without some restriction on the decision function, it is difficult to find a 
decision function that has a risk function which is uniformly less than the risk 
function of another decision. 


2. One principle of selecting a best decision function is called the minimax 
principle. This principle may be stated as follows: If the decision function 
given by do(y) is such that, for all 0 € Q, 


maxR[6, do(y)] < maxR[9, 6(y)] 


for every other decision function d(y), then do(y) is called a minimax deci- 
sion function. 


With the restriction E[5(Y)] = @ and the loss function £[0,5(y)| = [@ — 5(y)]?, 
the decision function that minimizes the risk function yields an unbiased estimator 
with minimum variance. If, however, the restriction E[5(Y )] = @ is replaced by some 
other condition, the decision function 6(Y), if it exists, which minimizes E{[@ — 
6(Y)]?} uniformly in @ is sometimes called the minimum mean-squared-error 
estimator. Exercises 7.1.6—7.1.8 provide examples of this type of estimator. 

There are two additional observations about decision rules and loss functions 
that should be made at this point. First, since Y is a statistic, the decision rule 
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0(Y) is also a statistic, and we could have started directly with a decision rule based 
on the observations in a random sample, say, 1(X1, X2,...,X,). The risk function 
is then given by 


R(O6, 61) 


E{L[O, 61(X1,...,Xn)]} 
[. of LO, b1(a1,.--,2n)|F(21; 8) +++ f(@n; 8) dary --- dan 


if the random sample arises from a continuous-type distribution. We do not do this, 
because, as we show in this chapter, it is rather easy to find a good statistic, say 
Y, upon which to base all of the statistical inferences associated with a particular 
model. Thus we thought it more appropriate to start with a statistic that would 
be familiar, like the mle Y = X in Example 7.1.2. The second decision rule of that 
example could be written 62(X 1, X2,..., Xn) = 0, a constant no matter what values 
of X1, Xo,...,Xp are observed. 

The second observation is that we have only used one loss function, namely, 
the squared-error loss function £(0,6) = (0 — 5)?. The absolute-error loss 
function £(0,6) = | — 6| is another popular one. The loss function defined by 


L(8, 5) = 0, 0 - 6| < a, 
= b, |0-—d|>a, 


where a and 0 are positive constants, is sometimes referred to as the goalpost loss 
function. The reason for this terminology is that football fans recognize that it 
is similar to kicking a field goal: There is no loss (actually a three-point gain) if 
within a units of the middle but 6 units of loss (zero points awarded) if outside that 
restriction. In addition, loss functions can be asymmetric as well as symmetric, as 
the three previous ones have been. That is, for example, it might be more costly to 
underestimate the value of # than to overestimate it. (Many of us think about this 
type of loss function when estimating the time it takes us to reach an airport to 
catch a plane.) Some of these loss functions are considered when studying Bayesian 
estimates in Chapter 11. 

Let us close this section with an interesting illustration that raises a question 
leading to the likelihood principle, which many statisticians believe is a quality 
characteristic that estimators should enjoy. Suppose that two statisticians, A and 
B, observe 10 independent trials of a random experiment ending in success or failure. 
Let the probability of success on each trial be 6, where 0 < # < 1. Let us say that 
each statistician observes one success in these 10 trials. Suppose, however, that 
A had decided to take n = 10 such observations in advance and found only one 
success, while B had decided to take as many observations as needed to get the first 
success, which happened on the 10th trial. The model of A is that Y is b(n = 10,6) 
and y = 1 is observed. On the other hand, B is considering the random variable Z 
that has a geometric pmf g(z) = (1—0)*~10, z = 1,2,3,..., and z = 10 is observed. 
In either case, an estimate of 0 could be the relative frequency of success given by 
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Let us observe, however, that one of the corresponding estimators, Y/n and 1/Z, 
is biased. We have 


while 


ts) 
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= 6+4(1-0)0+4(11-6)704--->8. 


That is, 1/Z is a biased estimator while Y/10 is unbiased. Thus A is using an 
unbiased estimator while B is not. Should we adjust B’s estimator so that it, too, 
is unbiased? 

It is interesting to note that if we maximize the two respective likelihood func- 
tions, namely, 


L1(0) = i) ev(1 — 9) !0- 
and 


L2(8) = (1—0)°~9, 
with n = 10, y = 1, and z = 10, we get exactly the same answer, 6 = a: This 
must be the case, because in each situation we are maximizing (1 — 0)°0. Many 
statisticians believe that this is the way it should be and accordingly adopt the 
likelihood principle: 

Suppose two different sets of data from possibly two different random experiments 
lead to respective likelihood ratios, L,(@) and L2(@), that are proportional to each 
other. These two data sets provide the same information about the parameter 0 and 
a statistician should obtain the same estimate of 0 from either. 

In our special illustration, we note that L1(@) «x L2(6), and the likelihood princi- 
ple states that statisticians A and B should make the same inference. Thus believers 
in the likelihood principle would not adjust the second estimator to make it unbi- 
ased. 


EXERCISES 


7.1.1. Show that the mean X of a random sample of size n from a distribution 
having pdf f(2;0) = (1/0@)e—*/®, 0 < & < co, 0 < 0 < &, zero elsewhere, is an 
unbiased estimator of @ and has variance 6?/n. 


7.1.2. Let X 1, X2,...,X, denote a random sample from a normal distribution 
with mean zero and variance 6, 0 < 6 < oo. Show that $°) X?/n is an unbiased 
estimator of @ and has variance 26?/n. 


418 Sufficiency 


7.1.3. Let Y, < Yo < Y3 be the order statistics of a random sample of size 3 from 
the uniform distribution having pdf f(#;0) = 1/0, 0< «<6, 0<@< o, zero 
elsewhere. Show that 4Y,, 2Y2, and sY3 are all unbiased estimators of 6. Find the 
variance of each of these unbiased estimators. 


7.1.4. Let Y; and Y2 be two independent unbiased estimators of 90. Assume that 
the variance of Y; is twice the variance of Y>. Find the constants k; and ke so that 
ky Y, + k2Y2 is an unbiased estimator with the smallest possible variance for such a 
linear combination. 


7.1.5. In Example 7.1.2 of this section, take L[0,d(y)] = |@ — o(y)|. Show that 
R(O,61) = $./2/m and R(O,52) = |6|. Of these two decision functions J; and dp, 
which yields the smaller maximum risk? 


7.1.6. Let X1, X2,...,X, denote a random sample from a Poisson distribution with 
parameter 0, 0 < @ < co. Let Y = )°) X; and let L[9,5(y)] = [9 — d(y)]?. If we 
restrict our considerations to decision functions of the form d(y) = b+y/n, where b 
does not depend on y, show that R(0,6) = b?+0/n. What decision function of this 
form yields a uniformly smaller risk than every other decision function of this form? 
With this solution, say 6, and 0 < 6 < oo, determine maxg R(6, 6) if it exists. 


7.1.7. Let X 1, Xo,...,Xn denote a random sample from a distribution that is 
N(u,9), 0 < @ < co, where y is unknown. Let Y = 30) (Xi — X)?/n and let 
L(A, 6(y)| = [@—6(y)]|?. If we consider decision functions of the form 6(y) = by, where 
b does not depend upon y, show that R(@,5) = (0?/n?)[(n? —1)b? —2n(n—1)b+n?]. 
Show that b = n/(n+1) yields a minimum risk decision function of this form. Note 
that nY/(n + 1) is not an unbiased estimator of 0. With d(y) = ny/(n + 1) and 
0 <0 <o, determine maxg R(0, 6) if it exists. 


7.1.8. Let X 1, Xo,...,Xn denote a random sample from a distribution that is 
b(1,0), O< O< 1. Let Y = $0) X; and let £L[O,d(y)] = [0 — 5(y)}?. Consider 
decision functions of the form 6(y) = by, where b does not depend upon y. Prove 
that R(0,5) = b?n6(1 — 0) + (bn — 1)70?. Show that 


btn? 


TE) Tita ey 


provided that the value b is such that b?n > (bn —1)?. Prove that b = 1/n does not 
minimize maxg R(0, 0). 


7.1.9. Let X1, X2,...,X, be a random sample from a Poisson distribution with 
mean @ > 0. 
(a) Statistician A observes the sample to be the values 21, %2,...,2%, with sum 


y = >> 2;. Find the mle of 6. 


(b) Statistician B loses the sample values 271, 22,...,@%n, but remembers the sum 
y, and the fact that the sample arose from a Poisson distribution. Thus 
B decides to create some fake observations, which he calls 21, 22,...,2n (as 
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he knows they will probably not equal the original z-values) as follows. He 
notes that the conditional probability of independent Poisson random vari- 
ables Z1, Z2,..., Zn, being equal to 21, 22,...,2n, given D> 2; = yi, is 


O71e79 972679 g7nre~9 21 z2 Zn 
Zl mel ae yi! 1 1 1 
(ndy¥ie="? z!zq!+++ zn! \n n n 


yi! 


since Yj; = )> Z; has a Poisson distribution with mean n0. The latter distri- 
bution is multinomial with y; independent trials, each terminating in one of n 
mutually exclusive and exhaustive ways, each of which has the same probabil- 
ity 1/n. Accordingly, B runs such a multinomial experiment y; independent 
trials and obtains 21, 22,..., Zn. Find the likelihood function using these z- 
values. Is it proportional to that of statistician A? 

Hint: Here the likelihood function is the product of this conditional pdf and 
the pdf of Y, = > Lie 


7.2 A Sufficient Statistic for a Parameter 


Suppose that X1, X2,...,X, is a random sample from a distribution that has pdf 
f(a;0), 8 € Q. In Chapters 4 and 6, we constructed statistics to make statistical 
inferences as illustrated by point and interval estimation and tests of statistical 
hypotheses. We note that a statistic, for example, Y = u(X1, X2,..., Xn), isa form 
of data reduction. To illustrate, instead of listing all of the individual observations 
X,,X2,...,Xn, we might prefer to give only the sample mean X or the sample 
variance S$”. Thus statisticians look for ways of reducing a set of data so that these 
data can be more easily understood without losing the meaning associated with the 
entire set of observations. 

It is interesting to note that a statistic Y = u(X1, X2,...,Xn) really partitions 
the sample space of X1, X2,...,X». For illustration, suppose we say that the sample 
was observed and © = 8.32. There are many points in the sample space which 
have that same mean of 8.32, and we can consider them as belonging to the set 
{(a1,@2,...,@n) :F = 8.32}. As a matter of fact, all points on the hyperplane 


Ly +g 4+++++ ap = (8.32)n 


yield the mean of % = 8.32, so this hyperplane is the set. However, there are many 
values that X can take, and thus there are many such sets. So, in this sense, the 
sample mean X, or any statistic Y = u(X1, X2,...,Xn), partitions the sample 
space into a collection of sets. 

Often in the study of statistics the parameter 6 of the model is unknown; thus, 
we need to make some statistical inference about it. In this section we consider a 
statistic denoted by Yy = ui(X1, X2,..., Xn), which we call a sufficient statistic 
and which we find is good for making those inferences. This sufficient statistic 
partitions the sample space in such a way that, given 


(X1, Xe,...,Xn) € {(@1, 22,..-,2n) 1 ui (#1, £2,..-,2n) = yr}, 
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the conditional probability of X,, X2,...,X, does not depend upon @. Intuitively, 
this means that once the set determined by Yi = y1 is fixed, the distribution of 
another statistic, say Yo = uo(X1, X2,..., Xn), does not depend upon the parameter 
@ because the conditional distribution of X1,X2,...,X, does not depend upon @. 
Hence it is impossible to use Y2, given Y; = y;, to make a statistical inference about 
@. So, in a sense, Y; exhausts all the information about @ that is contained in the 
sample. This is why we call Y; = ui(X1, X2,..., Xn) a sufficient statistic. 

To understand clearly the definition of a sufficient statistic for a parameter 6, 
we start with an illustration. 


Example 7.2.1. Let X,, X2,..., X, denote a random sample from the distribution 
that has pmf 


maf CO=—OF" g=O1, b20<1 
f(z;0) = { 0 elsewhere. 


The statistic Yj = X, + Xo +---+ X, has the pmf 


(5)000 =" = Os4ynm 
0) — y 
fy, y13 8) { 0 : elsewhere. 


What is the conditional probability 
P(X = 71, Xo = 1Q,--. .,Xn = Ln|V1 = Yi) = P(A|B), 


say, where y; = 0,1,2,...,n? Unless the sum of the integers 71, 72,...,%n, (each of 
which equals zero or 1) is equal to yi, the conditional probability obviously equals 
zero because AM B = ¢. But in the case y; = 5) 2;, we have that A C B, so that 
AN B=A and P(A|B) = P(A)/P(B); thus, the conditional probability equals 


g"1(1 = )!~ #192 (1 _ Q)!~#2 55, ge (1 = Q)!—#n > i(] = §j-z Ly 


(") gv1(1 — 9)" (ena ores 


Since yy = 71 +2%2+---+2, equals the number of ones in the n independent trials, 
this is the conditional probability of selecting a particular arrangement of y; ones 
and (n — y1) zeros. Note that this conditional probability does not depend upon 
the value of the parameter 0. m 


In general, let fy,(yi;@) be the pmf of the statistic Yy = ui(X1, Xo,...,Xn), 
where Xj, X2,..., X, is arandom sample arising from a distribution of the discrete 
type having pmf f(2;@), @ € Q. The conditional probability of X, = 21, X2 = 
L2,-.--,Xn =n, given Y; = yj, equals 


f (21; 9) f (x2; 0) +--+ f (nj 9) 


Fe [ua (a1, 22,-+.5 En) 70] 
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provided that x1,22,...,%, are such that the fixed y, = ui(x1,%2,...,%n), and 
equals zero otherwise. We say that Y, = ui(X1, X2,...,Xn) is a sufficient statistic 
for @ if and only if this ratio does not depend upon 6. While, with distributions of 
the continuous type, we cannot use the same argument, we do, in this case, accept 
the fact that if this ratio does not depend upon @, then the conditional distribution 
of X1, X2,...,Xn, given Y; = y, does not depend upon @. Thus, in both cases, we 
use the same definition of a sufficient statistic for 0. 


Definition 7.2.1. Let X,,X2,...,Xn denote a random sample of size n from a 
distribution that has pdf or pmf f(a;0), 0 € Q. Let Y; = u1(X1, X2,...,Xn) be a 
statistic whose pdf or pmf is fy,(yi;0). Then Y; is a sufficient statistic for 6 if 


and only if 
f (x1; 9) f (x23) +++ f (an; 9) 
iy [ui(z1, Za, tee yin JS 0 


where H(21,%2,...,%n) does not depend upon @ €Q. 


= A (a3 09}. -205En); 


Remark 7.2.1. In most cases in this book, X1, X2,..., Xn represent the observa- 
tions of a random sample; that is, they are iid. It is not necessary, however, in more 
general situations, that these random variables be independent; as a matter of fact, 
they do not need to be identically distributed. Thus, more generally, the definition 
of sufficiency of a statistic Y; = ui(X1, Xo,..., Xn) would be extended to read that 


f (21, %2,.--,2n30) 


= A, 0,.. En 
FNC er sae 7) iene 


does not depend upon 6 € Q, where f(x1,22,...,%n;0) is the joint pdf or pmf of 
X 1, X2,...,X»y. There are even a few situations in which we need an extension like 
this one in this book. = 


We now give two examples that are illustrative of the definition. 


Example 7.2.2. Let X,, X2,...,X, be a random sample from a gamma distribu- 

tion with a = 2 and G = @ > 0. Because the mgf associated with this distribution 

is given by M(t) = (1— 6t)~?, t < 1/0, the megf of Y; = 7, Xj is 
Efet(%1+Xat-+Xn)) = E(et*1) F(e'*2) i  E(e'Xn) 


= [(1- 6t)~?|” =(1- as ae 
Thus Y, has a gamma distribution with a = 2n and @ = 8, so that its pdf is 


1 2n-1,— (4) 
Fyn) = ( Tampere 8? O< m1 < 00 
ass 0 elsewhere. 


Thus we have 
ae te—#1/6 g2-le—v2/0 gee eal? 
T(2)6? T(2)6? T(2)6@? T(2n) U1U2Q+++ Ly 
(my tag +e004+ iy ee eee ea)? [P(2)]" (a1 +e +++ + an)2"-)? 
T(2n)02” 
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where 0 < a; < oc, i= 1,2,...,n. Since this ratio does not depend upon 6, the 
sum Y; is a sufficient statistic for 0. m 


Example 7.2.3. Let Yj < Yo <--- < Y, denote the order statistics of a random 
sample of size n from the distribution with pdf 


f(a;@) = a Tp eile): 
Here we use the indicator function of a set A defined by 


1 weEA 
1a(a) = { 0 «GA. 
This means, of course, that f(«;@) = e~*~®, 0 < x < ov, zero elsewhere. The pdf 
of Yj = min(X;) is 
fy (yi; 2) = ne 9) Teen (th). 


Note that 6 < min{xz;} if and only if 6 < a;, for alli = 1,...,n. Notationally this 
can be expressed as I(9,..)(min x;) = []}_y [(o,00)(ai). Thus we have that 


eae ey ee 


qe rn 8) T gc (min Li) newrmin x; 


Since this ratio does not depend upon 6, the first order statistic Y| is a sufficient 
statistic for 0. m 


If we are to show by means of the definition that a certain statistic Y, is or is not 
a sufficient statistic for a parameter 0, we must first of all know the pdf of Y,, say 
fy, (yi; ). In many instances it may be quite difficult to find this pdf. Fortunately, 
this problem can be avoided if we prove the following factorization theorem of 
Neyman. 


Theorem 7.2.1 (Neyman). Let X1,X2,...,X, denote a random sample from a 
distribution that has pdf or pmf f(a;0), 0€ Q. The statistic Y, = ui(X1,...,Xn) 
is a sufficient statistic for 0 if and only if we can find two nonnegative functions, 
ky and kg, such that 


f (x1; 9) f (x2; 8) ia Flanj 9) = ky [ur (x1, £2, tee ;@y)3 O]ka( x1, ta, tee 1n)s (7.2.1) 
where ko(x1,¥2,..-,2n) does not depend upon 6. 


Proof. We shall prove the theorem when the random variables are of the con- 
tinuous type. Assume that the factorization is as stated in the theorem. In our 
proof we shall make the one-to-one transformation y; = ui(@1,%2,.--,2n), Yo = 
U2(@1,02,---,;Un),---,Yn = Un(@1,%2,...,%p) having the inverse functions x, = 
W1(Y1, Y2,--+5 Yn), V2 = We(Y1,Y2,---; Yn), «+5 Ln = Wn(Y1, Y2,---;Yn) and Jaco- 
bian J; see the note after the proof. The pdf of the statistic Y1, Yo,...,Y, is then 
given by 
G(Y1s Yas +++, Yn39) = ki (yr; O)ko(wi, We,---,Wn)| II, 
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where w; = wi(Y1,Y2,---;Yn), /=1,2,...,n. The pdf of Yi, say fy, (y1; 6), is given 
by 


fi (yi 9) _ / of G(Y1; Yas» ++ 5 Yn; 9) dyo-++ dyn 
—! k(use) f | |J|ko(w1, W2,---,Wn) dy2:++ dyn. 


Now the function kz does not depend upon @, nor is @ involved in either the Jacobian 
J or the limits of integration. Hence the (n — 1)-fold integral in the right-hand 
member of the preceding equation is a function of y; alone, for example, m/(y1). 
Thus 


fy, (y13 8) = ki(yi; 8)m(y1)- 


If m(y1) = 0, then fy, (y1;4) = 0. If m(y1) > 0, we can write 


fy [ui (a1, see 5) 6] 
k ey En)3 8] = ————_ ., 
1[ui(@1, V2, ov ) ] mlui(@1,---,2n)] 
and the assumed factorization becomes 
ko(a1, ares sn) 


f(a130) +++ f(an3 0) = Pele isneeg tell ree vee Ln)| 


Since neither the function kz nor the function m depends upon 0, then in accordance 
with the definition, Y, is a sufficient statistic for the parameter 6. 

Conversely, if Y; is a sufficient statistic for 0, the factorization can be realized by 
taking the function k; to be the pdf of Yi, namely, the function fy,. This completes 
the proof of the theorem. m 


Note that the assumption of a one-to-one transformation made in the proof is not 
needed; see Lehmann (1986) for a more rigorous prrof. This theorem characterizes 
sufficiency and, as the following examples show, is usually much easier to work with 
than the definition of sufficiency. 


Example 7.2.4. Let X,, X2,...,X, denote a random sample from a distribution 
that is N(0,07), —oo < @ < oo, where the variance o? > 0 is known. If 7 = 
yy zi /n, then 


3 


because 
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Thus the joint pdf of X,, X2,...,X, may be written 
—— } exp |—S0(a; - 0)?/20 
exp - 
i=1 


ov 20 j=l 
(x4 = =) tot 
= {exp|—n(z — @)*/20°]} (ony 


Because the first factor of the right-hand member of this equation depends upon 
1,%2,...,%y only through %, and the second factor does not depend upon @, the 
factorization theorem implies that the mean X of the sample is, for any particular 
value of o?, a sufficient statistic for 9, the mean of the normal distribution. m 


We could have used the definition in the preceding example because we know 
that X is N(0,0?/n). Let us now consider an example in which the use of the 
definition is inappropriate. 


Example 7.2.5. Let X1, X2,...,X, denote a random sample from a distribution 
with pdf 


ge i Oe Oe ee 
ae) { 0 elsewhere, 


where 0 < 6. The joint pdf of X1, X2,..., Xn is 


“(its) =| (i) | (pts) 


where 0 < a; <1, i=1,2,...,n. In the factorization theorem, let 
o 0 
ky [ui(@1,22,---,%n); 6] = 0” (IIs) 
i=1 
and 
hal )= or 
24 L1,2Q,---, 2 = = 
ae [Tina 2 

Since k2(a1,22,...,2n) does not depend upon 6, the product []/_, X; is a sufficient 


statistic for 0. @ 


There is a tendency for some readers to apply incorrectly the factorization theo- 
rem in those instances in which the domain of positive probability density depends 
upon the parameter 6. This is due to the fact that they do not give proper consid- 
eration to the domain of the function k2(x1,22,...,2%n). This is illustrated in the 
next example. 
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Example 7.2.6. In Example 7.2.3 with f(2;6) = e~ @-®) Tg ) (x), it was found 
that the first order statistic Y, is a sufficient statistic for 6. To illustrate our point 
about not considering the domain of the function, take n = 3 and note that 


€ (x1 Ve (xo Ve (x3—0) = le Sine e || PE Ra Oa aR) 


or a similar expression. Certainly, in the latter formula, there is no @ in the second 
factor and it might be assumed that Y3 = maxX; is a sufficient statistic for 0. Of 
course, this is incorrect because we should have written the joint pdf of X 1, X2, X3 


as 
. 3 
[fle L66,00) (24)] = [e*? I6,00) (min x;)| [bs {-d=}} 


i=1 
because I[(g,..)(min xj) = I(6,0) (#1) (6,00) (2) 1(6,00) (3). A similar statement can- 
not be made with maxz;. Thus Y; = minX; is the sufficient statistic for 0, not 
Y3 = max X;. 


EXERCISES 


7.2.1. Let X1,X2,...,Xn be iid N(0,0), 0 < 8 < oo. Show that >>) X? is a 
sufficient statistic for 0. 


7.2.2. Prove that the sum of the observations of a random sample of size n from a 
Poisson distribution having parameter 0, 0 < 6 < o, is a sufficient statistic for 0. 


7.2.3. Show that the nth order statistic of a random sample of size n from the 
uniform distribution having pdf f(#;0) = 1/0, 0 < « < 6, 0 < 6 < ™, zero 
elsewhere, is a sufficient statistic for 6. Generalize this result by considering the pdf 
f(a; 0) = Q(@)M (a), 0< a < 0, 0<@< ~™, zero elsewhere. Here, of course, 


a 1 
| M(a) dz = ——~. 
0 


Q(8) 
7.2.4. Let X1, X2,...,X, be a random sample of size n from a geometric distribu- 
tion that has pmf f(#;6) = (1 — 0)*0, « = 0,1,2,..., 0 < 6 < 1, zero elsewhere. 


Show that >>) X; is a sufficient statistic for 0. 


7.2.5. Show that the sum of the observations of a random sample of size n from 
a gamma distribution that has pdf f(x;0) = (1/0)e~*/°, 0< «<0, 0<0<«@, 
zero elsewhere, is a sufficient statistic for 0. 


7.2.6. Let X1, Xo,...,X, be a random sample of size n from a beta distribution 
with parameters a = 6 and @ = 5. Show that the product X;X2--: Xv, is a sufficient 
statistic for 0. 


7.2.7. Show that the product of the sample observations is a sufficient statistic for 


# > 0 if the random sample is taken from a gamma distribution with parameters 
a=@and G=6. 
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7.2.8. What is the sufficient statistic for 0 if the sample arises from a beta distri- 
bution in which a= 6 =6@> 0? 


7.2.9. We consider a random sample X 1, X2,...,X, from a distribution with pdf 
f(a; 0) = (1/0) exp(—2/0), 0 < a < oo, zero elsewhere, where 0 < 6. Possibly, in a 
life-testing situation, however, we only observe the first 7 order statistics Yj < Yo < 
ree ct Ye. 


(a) Record the joint pdf of these order statistics and denote it by L(6). 
(b) Under these conditions, find the mle, 6, by maximizing L(6). 
(c) Find the mgf and pdf of 6. 


(d) With a slight extension of the definition of sufficiency, is 6 a sufficient statistic? 


7.3 Properties of a Sufficient Statistic 


Suppose X1, X2,..., Xp, is arandom sample on a random variable with pdf or pmf 
f(a; 0), where 6 € Q. In this section we discuss how sufficiency is used to determine 
MVUEs. First note that a sufficient estimate is not unique in any sense. For if 
Y, = u1(X1, Xo,..., Xn) is a sufficient statistic and Y2 = g(Y1) is a statistic, where 
g(x) is a one-to-one function, then 
Ff (v1; 9) f (x23 0) +++ flanj) = ky [ur (yr); O]ko(w1, v2,..-,2n) 
= ky[ui(g7*(y2)); @)k2 (a1, 22,.--,2n)5 

hence, by the factorization theorem, Y2 is also sufficient. However, as the theorem 
below shows, sufficiency can lead to a best point estimate. 


We first refer back to Theorem 2.3.1 of Section 2.3: If X; and X2 are random 
variables such that the variance of X2 exists, then 


E[X9] = E[E(X2|X1)] 
and 
Var(X2) > Var[E(X2|X1))]. 
For the adaptation in the context of sufficient statistics, we let the sufficient statistic 


Y, be X; and Yo, an unbiased statistic of 6, be X2. Thus, with E(Y2|y1) = y(y1), 
we have 


0 = E(¥2) = E[y(%)] 
and 


Var(Y2) > Var[y(¥1))]. 
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That is, through this conditioning, the function y(Y1) of the sufficient statistic Y1 
is an unbiased estimator of 6 having a smaller variance than that of the unbiased 
estimator Y2. We summarize this discussion more formally in the following theorem, 
which can be attributed to Rao and Blackwell. 


Theorem 7.3.1 (Rao—Blackwell). Let X1,X2,...,Xn, n a fixed positive integer, 
denote a random sample from a distribution (continuous or discrete) that has pdf or 
pmf f(a;0), 0@E QO. Let Yi = u1(X1, X2,...,Xn) be a sufficient statistic for 6, and 
let Yo = ug(X1, X2,...,Xn), not a function of Y; alone, be an unbiased estimator 
of 0. Then E(Yaly1) = ply) defines a statistic y(Y1). This statistic p(Y1) is a 
function of the sufficient statistic for 0; it is an unbiased estimator of 6; and its 
variance is less than or equal to that of Yo. 


This theorem tells us that in our search for an MVUE of a parameter, we may, 
if a sufficient statistic for the parameter exists, restrict that search to functions of 
the sufficient statistic. For if we begin with an unbiased estimator Y2 alone, then 
we can always improve on this by computing F(Y2|y1) = y(yi) so that y(Y1) is an 
unbiased estimator with a smaller variance than that of Y. 

After Theorem 7.3.1, many students believe that it is necessary to find first 
some unbiased estimator Y2 in their search for y(Y1), an unbiased estimator of 0 
based upon the sufficient statistic Y,. This is not the case at all, and Theorem 7.3.1 
simply convinces us that we can restrict our search for a best estimator to functions 
of Y,. Furthermore, there is a connection between sufficient statistics and maximum 
likelihood estimates, as shown in the following theorem: 


Theorem 7.3.2. Let X1,Xo,...,Xn denote a random sample from a distribution 
that has pdf or pmf f(x;0), @€ Q. If a sufficient statistic Yy = us(X1, Xo,...,Xn) 
for 0 exists and if a maximum likelihood estimator 6 of @ also exists uniquely, then 
6 is a function of Y; = uy(X1, Xo,...,Xn)- 


Proof. Let fy, (y1;@) be the pdf or pmf of Y;. Then by the definition of sufficiency, 
the likelihood function 


DAG; B14, 2050 +042) = f (21; 9) f (2; 0) ++ f(tn; 9) 


fag i tye a ey nO] A 1, Won ag), 


where H(a1,22,...,2n) does not depend upon @. Thus L and fy,, as functions 
of #, are maximized simultaneously. Since there is one and only one value of 0 
that maximizes L and hence fy,[ui(11,v2,.--,%n);6], that value of 0 must be a 
function of ui(a1,22,...,2%n). Thus the mle 6 is a function of the sufficient statistic 
Yy, = ui(X1, Xo, se pA). | 


We know from Chapters 4 and 6 that, generally, mles are asymptotically unbi- 
ased estimators of #. Hence, one way to proceed is to find a sufficient statistic and 
then find the mle. Based on this, we can often obtain an unbiased estimator that 
is a function of the sufficient statistic. This process is illustrated in the following 
example. 
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Example 7.3.1. Let X1,...,X, be iid with pdf 


be O<ar<w,d>0 
f(ai0) = { 0 elsewhere. 


Suppose we want an MVUE of 0. The joint pdf (likelihood function) is 
L(0;21,...,;2n) = Ore P Liar ~~ for a; > 0, i=1,...,n. 


Hence, by the factorization theorem, the statistic Y; = 57", X; is sufficient. The 
log of the likelihood function is 


1(9) = nlog@ - 0° ay. 
i=l 


Taking the partial with respect to @ of I(@) and setting it to 0 results in the mle of 
8, which is given by 


Note that Yj = n/Yj is a function of the sufficient statistic Y;. Also, since Y2 is the 
mle of 0, it is asymptotically unbiased. Hence, as a first step, we shall determine 
its expectation. In this problem, X; are iid T'(1,1/0) random variables; hence, 
Yi = 5j_1 Xi is '(n, 1/6). Therefore, 


1 1 Peg 
E(Y¥2)=FE\|=| =nE| =| = — ap te dt; 
2) | . | nf T(n) ° 


making the change of variable z = 6t and simplifying results in 


n—-1 


E(¥%)=E = = aoe ~1)=6 


Thus the statistic [(m — 1)¥a]/n = (n — 1)/ iL, Xi is an MVUE of 6. m 


In the next two sections, we discover that, in most instances, if there is one 
function y(Yi) that is unbiased, y(Y1) is the only unbiased estimator based on the 
sufficient statistic Y;. 


Remark 7.3.1. Since the unbiased estimator y(Yi), where y(¥1) = E(Y2|y1), has 
a variance smaller than that of the unbiased estimator Y2 of 0, students sometimes 
reason as follows. Let the function Y(ys3) = E[y(¥i)|Y3 = ys], where Y3 is another 
statistic, which is not sufficient for 6. By the Rao—Blackwell theorem, we have 
E[Y(¥3)] = 0 and Y(¥3) has a smaller variance than does y(Y,). Accordingly, 
Y(Y3) must be better than y(Y1) as an unbiased estimator of 6. But this is not true, 
because Y3 is not sufficient; thus, 6 is present in the conditional distribution of Y,, 
given Y3 = y3, and the conditional mean Y(y3). So although indeed E[Y(Y3)] = 9, 
YT(Y3) is not even a statistic because it involves the unknown parameter @ and hence 
cannot be used as an estimate. m 
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We illustrate this remark in the following example. 


Example 7.3.2. Let X,,X2,X3 be a random sample from an exponential distri- 
bution with mean @ > 0, so that the joint pdf is 


it 3 
(3) e (iteats)/9 <a; < 00, 


i = 1,2,3, zero elsewhere. From the factorization theorem, we see that Y; = 
X,+ Xo + X3 is a sufficient statistic for 0. Of course, 


E(Y) = E(X1 + Xo + X3) = 38, 


and thus Y;/3 = X is a function of the sufficient statistic that is an unbiased 
estimator of 0. 
In addition, let Yo = X2+ X3 and Y3 = X3. The one-to-one transformation 
defined by 
T%] =Y1— Y2, %2=Y2—-Y3, 3 = Y3 


has Jacobian equal to 1 and the joint pdf of Yi, Y2, Y3 is 


3 
1 
G41, Y2, Y33 9) = (3) el? O< ys <yo< yi <o, 


zero elsewhere. The marginal pdf of Y; and Y3 is found by integrating out y2 to 
obtain 


0 


zero elsewhere. The pdf of Y3 alone is 


3 
1 
gi3(Y1,Y3;9) = (5) (yi —ys)e 42/9, O<y3<y1 < 00, 


1 
93(y3;) = ac 0< 3 < ~, 
zero elsewhere, since Y3 = X3 is an observation of a random sample from this 


exponential distribution. 
Accordingly, the conditional pdf of Yi, given Y3 = ys, is 


gi3(Y1 Y33 9) 
93(y3; 9) 


1\2 
(3) (y1 = y3)e~ Yr —93)/8 O< Y3 ee Yi < oO, 
zero elsewhere. Thus 


0 
Yi Y= Ys 
E{ — = # 
(33, = 2(S lm) +6(fn) 
co 4 2 y 
y 


(3) T(3)83 ys 20 yg 


I 


g\3(yily3) 


l| 


gp tag tg |e 
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Of course, E[Y(Y3)] = 6 and var[Y(Y3)] < var(¥1/3), but T(¥3) is not a statistic, as 
it involves 6 and cannot be used as an estimator of @. This illustrates the preceding 
remark. 


EXERCISES 


7.3.1. In each of Exercises 7.2.1—7.2.4, show that the mle of 6 is a function of the 
sufficient statistic for 0. 


7.3.2. Let Yi < Yo < Y3 < Y4 < Ys be the order statistics of a random sample of size 
5 from the uniform distribution having pdf f(#;@) =1/0,0<a1<0,0<0<.o, 
zero elsewhere. Show that 2Y3 is an unbiased estimator of 6. Determine the joint 
pdf of Y3 and the sufficient statistic Y5 for 6. Find the conditional expectation 
E(2Y3|ys) = y(ys). Compare the variances of 2Y3 and y(Y5). 


7.3.3. If X,,X2 is a random sample of size 2 from a distribution having pdf 
f(a;0) = (1/0)e-*/°, 0 < & < cw, 0 < 6 < ow, zero elsewhere, find the joint 
pdf of the sufficient statistic Y; = X, + X2 for 8 and Y2 = X2. Show that Y> is an 
unbiased estimator of @ with variance 67. Find E(Y2|y1) = y(y1) and the variance 
of y(Y¥1). 


7.3.4. Let f(a,y) = (2/0?)e—@+9)/9, 0 < a < y < ov, zero elsewhere, be the joint 
pdf of the random variables X and Y. 


(a) Show that the mean and the variance of Y are, respectively, 30/2 and 56?/4. 


(b) Show that E(Y|x) = a+. In accordance with the theory, the expected value 
of X + 6 is that of Y, namely, 30/2, and the variance of X + @ is less than 
that of Y. Show that the variance of X + 6 is in fact 07/4. 


7.3.5. In each of Exercises 7.2.1—7.2.3, compute the expected value of the given 
sufficient statistic and, in each case, determine an unbiased estimator of # that is a 
function of that sufficient statistic alone. 


7.3.6. Let X), X2,...,X, be a random sample from a Poisson distribution with 
mean 9. Find the conditional expectation E(X, + 2X2 +3X3| >>} Xi). 
7.4 Completeness and Uniqueness 


Let X1, X2,...,X, be a random sample from the Poisson distribution that has pmf 


O%e~? 
: = Pal B= 01,2) 0005 0>0 
f(z; 0) = { 0 elsewhere. 


From Exercise 7.2.2, we know that Y; = eae X; is a sufficient statistic for 6 and 


its pmf is 
vp ,—n@ 
8) — y =0,1,2,... 
0 


elsewhere. 
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Let us consider the family {gi (y1;9) : 8 > 0} of probability mass functions. Suppose 
that the function u(Yi) of Y; is such that E[u(Y1)] = 0 for every 6 > 0. We shall 
show that this requires u(y,) to be zero at every point y; = 0,1,2,.... That is, 
E|u(%)] = 0 for 6 > 0 requires 


We have for all 8 > 0 that 


es n Yi end 
0=Bluny) = Yo uy) 
yi=0 : 
= @ wo + u(y +u(2) we) pe 


Since e~”® does not equal zero, we have shown that 


However, if such an infinite (power) series converges to zero for all @ > 0, then each 
of the coefficients must equal zero. That is, 


and thus 0 = u(0) = u(1) = u(2) = ---, as we wanted to show. Of course, the 
condition E/u(Y,)] = 0 for all 6 > 0 does not place any restriction on u(y1) when yj; 
is not a nonnegative integer. So we see that, in this illustration, E[u(Y1)] = 0 for all 
6 > 0 requires that u(y:) equals zero except on a set of points that has probability 
zero for each pmf gi(yi;9), 0 < 6. From the following definition we observe that 
the family {g1(yi;9) : 0 < 0} is complete. 


Definition 7.4.1. Let the random variable Z of either the continuous type or the 
discrete type have a pdf or pmf that is one member of the family {h(z;6):0€ QO}. If 
the condition E|u(Z)| = 0, for every 6 € Q, requires that u(z) be zero except on a set 
of points that has probability zero for each h(z;0), @€ Q, then the family {h(z; 0) : 
6 € Q} is called a complete family of probability density or mass functions. 


Remark 7.4.1. In Section 1.8, it was noted that the existence of E/u(X)] implies 
that the integral (or sum) converges absolutely. This absolute convergence was 
tacitly assumed in our definition of completeness and it is needed to prove that 
certain families of probability density functions are complete. m 


In order to show that certain families of probability density functions of the 
continuous type are complete, we must appeal to the same type of theorem in anal- 
ysis that we used when we claimed that the moment generating function uniquely 
determines a distribution. This is illustrated in the next example. 
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Example 7.4.1. Consider the family of pdfs {h(z;0) : 0 < @ < oo}. Suppose Z 
has a pdf in this family given by 


le-#/8 Q<z<oo 
h(2;0) = { 0 elsewhere. 


Let us say that E[u(Z)] = 0 for every 0 > 0. That is, 


1 co 

; | u(zje*7/" dz =0, @>0. 

9 Jo 

Readers acquainted with the theory of transformations recognize the integral in the 
left-hand member as being essentially the Laplace transform of u(z). In that theory 
we learn that the only function u(z) transforming to a function of @ that is identically 
equal to zero is u(z) = 0, except (in our terminology) on a set of points that has 
probability zero for each h(z;0), 0 > 0. That is, the family {h(z;0) :0 <0 < co} 
is complete. 


Let the parameter 6 in the pdf or pmf f(x;@), 6 € Q, have a sufficient statistic 
Y, = u1(X1, Xo,...,Xn), where X1, Xo,...,X, is a random sample from this dis- 
tribution. Let the pdf or pmf of Y; be fy,(yi;@), 8 € Q. It has been seen that if 
there is any unbiased estimator Y2 (not a function of Y; alone) of 0, then there is at 
least one function of Y; that is an unbiased estimator of 0, and our search for a best 
estimator of 6 may be restricted to functions of Yj. Suppose it has been verified 
that a certain function y(Y1), not a function of 0, is such that E[y(Yi)] = @ for all 
values of 0, 6 € Q. Let u(¥,) be another function of the sufficient statistic Y; alone, 
so that we also have E[w(Y1)] = 0 for all values of 0, 0 € 2. Hence 


Elp(%i) -¥(%1)) =0, ea. 


If the family { fy, (yi; 6) : 6 € OQ} is complete, the function of y(yi) — Y(y1) = 0, 
except on a set of points that has probability zero. That is, for every other unbiased 
estimator ~(Y1) of 6, we have 


g(y1) = vy) 


except possibly at certain special points. Thus, in this sense [namely y(yi) = v(y1), 
except on a set of points with probability zero], y(Y1) is the unique function of Y;, 
which is an unbiased estimator of #. In accordance with the Rao—Blackwell theorem, 
(Y1) has a smaller variance than every other unbiased estimator of 6. That is, the 
statistic y(Yi) is the MVUE of 6. This fact is stated in the following theorem of 
Lehmann and Scheffé. 


Theorem 7.4.1 (Lehmann and Scheffé). Let X1,X2,...,Xn, n a fixed positive 
integer, denote a random sample from a distribution that has pdf or pmf f(x;0), 0 € 
Q, let Yi = u1(X1, Xo,...,Xn) be a sufficient statistic for 6, and let the family 
{fyi (yi; 0) : 8 © OQ} be complete. If there is a function of Y; that is an unbiased 
estimator of 0, then this function of Y; is the unique MVUE of 0. Here “unique” is 
used in the sense described in the preceding paragraph. 
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The statement that Y; is a sufficient statistic for a parameter 0, 0 € Q, and that 
the family { fy, (y1;@) : 8 € OQ} of probability density functions is complete is lengthy 
and somewhat awkward. We shall adopt the less descriptive, but more convenient, 
terminology that Y; is a complete sufficient statistic for #6. In the next section, 
we study a fairly large class of probability density functions for which a complete 
sufficient statistic Y, for 9 can be determined by inspection. 


Example 7.4.2 (Uniform Distribution). Let X1, X2,...,X, be a random sample 
from the uniform distribution with pdf f(a;6) = 1/0,0 < « < 6,6 > 0, and 
zero elsewhere. As Exercise 7.2.3 shows, Y,, = max{X,, Xo,..., Xn} is a sufficient 
statistic for 6. It is easy to show that the pdf of Y,, is 


nynt 
glyni0) = 4 ~ ar OS In <8 (7.4.1) 
0 elsewhere. 


To show that Y,, is complete, suppose for any function u(t) and any 0 that E[u(Y,)] = 


0; i.e., 
0 n—-1 
nt 
0= | u(t) an dt. 
0 
Since 6 > 0, this equation is equivalent to 
0 
0 = u(t)t”—" dt. 
0 


Taking partial derivatives of both sides with respect to @ and using the Fundamental 
Theorem of Calculus, we have 


0=u(6)e"’. 


Since 6 > 0, u(@) = 0, for all 6 > 0. Thus Y,, is a complete and sufficient statistic 
for @. It is easy to show that 


0 n—-1 
™ 
Ba) = f y 7 dy = 
0 


“9. 
n+1 
Therefore, the MVUE of 6 is ((n+1)/n)Y,. m 


EXERCISES 


7.4.1. If az? +bz+¢=0 for more than two values of z, then a = b =c = 0. Use 
this result to show that the family {b(2, 6) : 0 < 6 < 1} is complete. 


7.4.2. Show that each of the following families is not complete by finding at least 
one nonzero function u(x) such that E[u(X)] = 0, for all 6 > 0. 


(a) 
f(x;0) = 


wy WO<a@<0, where0<0< 
elsewhere. 
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(b) N(0,0), where 0 <0 < om. 


7.4.3. Let X1, X2,...,X» represent a random sample from the discrete distribution 
having the pmf 


yp . | CA oy g=),1, 0<8<1 
F(z;9) = { 0 elsewhere. 


Show that Y; = }7/ X; is a complete sufficient statistic for 6. Find the unique 
function of Y; that is the MVUE of 0. 

Hint: Display E[u(Yi)] = 0, show that the constant term u(0) is equal to zero, 
divide both members of the equation by 6 4 0, and repeat the argument. 


7.4.4. Consider the family of probability density functions {h(z;0) : 0 € Q}, where 
h(z;0) = 1/0, 0 < z < @, zero elsewhere. 


(a) Show that the family is complete provided that Q = {8:0 <6 < oo}. 
Hint: For convenience, assume that u(z) is continuous and note that the 
derivative of E[u(Z)] with respect to 0 is equal to zero also. 


(b) Show that this family is not complete if Q = {0:1< 0 < oo}. 
Hint: Concentrate on the interval 0 < z < 1 and find a nonzero function 
u(z) on that interval such that E[u(Z)] = 0 for all @ > 1. 


7.4.5. Show that the first order statistic Y, of a random sample of size n from 
the distribution having pdf f(a;@) = e~-®, 0 < x < 00, —00 < 8 < ~, zero 
elsewhere, is a complete sufficient statistic for 0. Find the unique function of this 
statistic which is the MVUE of @. 


7.4.6. Let a random sample of size n be taken from a distribution of the discrete 
type with pmf f(a;@) = 1/0, « = 1,2,...,0, zero elsewhere, where @ is an unknown 
positive integer. 


(a) Show that the largest observation, say Y, of the sample is a complete sufficient 
statistic for 0. 


(b) Prove that 
agai _ (Y _ nei _ (Y _ ij" 


is the unique MVUE of 6. 


7.4.7. Let X have the pdf fx (x; 0) = 1/(20), for —0 < x < 0, zero elsewhere, where 
06> 0. 


(a) Is the statistic Y = |X| a sufficient statistic for 0? Why? 


(b) Let fy(y; 0) be the pdf of Y. Is the family {fy (y; 4) : 6 > 0} complete? Why? 


7.4.8. Let X have the pmf p(x; 0) = $(")al?!(1—@)"—l*l, for « = £1, +2,...,£n, 


|| 


p(0,0) = (1 — 0)”, and zero elsewhere, where 0 < 0 < 1. 


(a) Show that this family {p(a;@) : 0 < 0 < 1} is not complete. 
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(b) Let Y = |X|. Show that Y is a complete and sufficient statistic for 0. 


7.4.9. Let X1,...,X, be iid with pdf f(x; 0) = 1/(30), —0 < a < 20, zero else- 
where, where @ > 0. 


(a) Find the mle @ of 6. 
(b) Is @ a sufficient statistic for 0? Why? 
(c) Is (n + 1)0/n the unique MVUE of 6? Why? 


7.4.10. Let Y; < Yo <--- < Y,, be the order statistics of a random sample of size n 
from a distribution with pdf f(x;@) = 1/0, 0 < x < @, zero elsewhere. By Example 
7.4.2, the statistic Y,, is a complete sufficient statistic for 6 and it has pdf 


9(Yn3 9) = ain » O< In <9, 


and zero elsewhere. 
(a) Find the distribution function H,,(z;0) of Z = n(@—Y,). 


(b) Find the lim,... H(z; 0) and thus the limiting distribution of Z. 


7.5 The Exponential Class of Distributions 


In this section we discuss an important class of distributions, called the exponential 
class. As we show, this class possesses complete and sufficient statistics which are 
readily determined from the distribution. 

Consider a family {f(2;0) : 6 € Q} of probability density or mass functions, 
where (2 is the interval set Q = {0 : 7 < 0 < 6}, where y and 6 are known constants 
(they may be +oo), and where 


ire SRORO PHA) S AO! Bee Gea 


where S is the support of X. In this section we are concerned with a particular 
class of the family called the regular exponential class. 


Definition 7.5.1 (Regular Exponential Class). A pdf of the form (7.5.1) is said 
to be a member of the regular exponential class of probability density or mass 
functions if 


1. S, the support of X, does not depend upon @ 
2. p(@) is a nontrivial continuous function of 0 € OQ 
3. Finally, 


(a) if X is a continuous random variable, then each of K'(x) #0 and H(x) 
is a continuous function of x € S, 


436 Sufficiency 


(b) if X is a discrete random variable, then K(x) is a nontrivial function of 
LES. 


For example, each member of the family {f(x;@) : 0 < 0 < co}, where f(x; 0) 
is N(0,0), represents a regular case of the exponential class of the continuous type 
because 

1 2 
zr; Q — ——_, /20 
f(x; 0) a 


1 
exp (-5" - log V2"8 , ~O<U< Ow. 


I 


On the other hand, consider the uniform density function given by 


f(«;0) = { sried) Be) 


elsewhere. 


This can be written in the form (7.5.1), but the support is the interval (0,0), which 
depends on @. Hence the uniform family is not a regular exponential family. 

Let X1, X2,...,X, denote a random sample from a distribution that represents 
a regular case of the exponential class. The joint pdf or pmf of X1, X2,...,Xp is 


exp [bo >> K(e) +) He) + na) 


for 7; € S, i = 1,2,...,n and zero elsewhere. At points in the S of X, this joint 
pdf or pmf may be written as the product of the two nonnegative functions 


exp [po > *e)+ na) exp bs A(x) 


In accordance with the factorization theorem, Theorem 7.2.1, Y; = )77 K(X;) isa 
sufficient statistic for the parameter 6. 

Besides the fact that Y, is a sufficient statistic, we can obtain the general form 
of the distribution of Y; and its mean and variance. We summarize these results in 
a theorem. The details of the proof are given in Exercises 7.5.5 and 7.5.8. Exercise 
7.5.6 obtains the mgf of Y; in the case that p(@) = 0. 


Theorem 7.5.1. Let X1,Xo,...,Xn denote a random sample from a distribution 


that represents a regular case of the exponential class, with pdf or pmf given by 
(7.5.1). Consider the statistic Y, = yi K(X;). Then 


1. The pdf or pmf of Y, has the form 
fyi (ys 8) = R(y1) exp[p(@)ian + nq(A)), (7.5.2) 
for yi € Sy, and some function R(y1). Neither Sy, nor R(y1) depends on 0. 


aC) 
2. B(Yi) = —n 5th. 
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9. Var(¥,) = nays {p"(8)q! (9) — 4" (0)p!(8)}- 


Example 7.5.1. Let X have a Poisson distribution with parameter 0 € (0,00). 
Then the support of X is the set S = {0,1,2,...}, which does not depend on @. 
Further, the pmf of X on its support is 


Ba 


f(x, 0) = a = exp{ (log 0)x + log(1/a!) + (—@)}. 


Hence the Poisson distribution is a member of the regular exponential class, with 
p(O) = log(@), q(@) = —6, and K(a) = x. Therefore, if X1,X2,...,X» denotes 
a random sample on X, then the statistic Yj = }>"_, X; is sufficient. But since 
p'(@) = 1/0 and q'(@) = —1, Theorem 7.5.1 verifies that the mean of Y; is n@. It 
is easy to verify that the variance of Y, is n@ also. Finally, we can show that the 
function R(y,) in Theorem 7.5.1 is given by R(y1) = n¥'(1/y1!). m 


For the regular case of the exponential class, we have shown that the statistic 
Y, = 0} K(X;) is sufficient for 6. We now use the form of the pdf of Y; given in 
Theorem 7.5.1 to establish the completeness of Y;. 


Theorem 7.5.2. Let f(x;@), y < 6 < 0, be a pdf or pmf of a random variable X 
whose distribution is a regular case of the exponential class. Then if X1,X2,...,Xn 
(where n is a fixed positive integer) is a random sample from the distribution of X, 
the statistic Y; = )\} K(X;) is a sufficient statistic for 6 and the family { fy,(y1;9) : 
y < @ < 6} of probability density functions of Y; is complete. That is, Y, is a 
complete sufficient statistic for 6. 


Proof: We have shown above that Y, is sufficient. For completeness, suppose that 
E|u(¥%)] = 0. Expression (7.5.2) of Theorem 7.5.1 gives the pdf of Y;. Hence we 
have the equation 


; u(ys) (yn) exp{p()ur + nq(6)} dyn = 0 
or equivalently since exp{ng(0)} 4 0, 
i, u(y) R(y1) exp{p()yr} dys = 0 


for all 6. However, p(@) is a nontrivial continuous function of 0, and thus this 
integral is essentially a type of Laplace transform of u(yi)R(y1). The only function 
of y; transforming to the 0 function is the zero function (except for a set of points 
with probability zero in our context). That is, 


u(y) R(y1) = 0. 


However, R(y1) 4 0 for all y; € Sy, because it is a factor in the pdf of Y;. Hence 
u(y1) = 0 (except for a set of points with probability zero). Therefore, Y; is a 
complete sufficient statistic for 0. m 
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This theorem has useful implications. In a regular case of form (7.5.1), we can 
see by inspection that the sufficient statistic is Y; = }> K(X;). If we can see how 
to form a function of Yi, say y(¥1), so that E[y(Yi)] = @, then the statistic p(Y1) 
is unique and is the MVUE of 6. 


Example 7.5.2. Let X1, X2,...,X, denote a random sample from a normal dis- 
tribution that has pdf 


|. -w<@<mw, -wK<b<w, 


or 
f(a; 0) = exp (= - a — log V 210? — sz) . 
o 20? 20? 
Here o? is any fixed positive number. This is a regular case of the exponential class 
with 
0 


p(0) = 4, K(a)=2, 
2 62 
H(x) => ~gg2 ~ leg 2007, q(9) = —x5- 


Accordingly, Y, = X; + Xo +-:-+X, = nX is a complete sufficient statistic for 
the mean @ of a normal distribution for every fixed value of the variance a. Since 
E(Y;) = n6, then y(¥1) = Yi/n = X is the only function of Y; that is an unbiased 
estimator of 9; and being a function of the sufficient statistic Y1, it has a minimum 
variance. That is, X is the unique MVUE of @. Incidentally, since Y; is a one-to-one 
function of X, X itself is also a complete sufficient statistic for 0. 


Example 7.5.3 (Example 7.5.1, Continued). Reconsider the discussion concerning 
the Poisson distribution with parameter 6 found in Example 7.5.1. Based on this 
discussion, the statistic Y; = 3 X; was sufficient. It follows from Theorem 
7.5.2 that its family of distributions is complete. Since E(Y1) = n0, it follows that 
X =n7'Y, is the unique MVUE of 6. = 


EXERCISES 
7.5.1. Write the pdf 


1 
f(a;@) = agave, 0<a<ow, 0<40<oaM, 
zero elsewhere, in the exponential form. If X), X2,...,X, is arandom sample from 
this distribution, find a complete sufficient statistic Y, for 6 and the unique function 
(V1) of this statistic that is the MVUE of 0. Is y(Y1) itself a complete sufficient 
statistic? 


7.5.2. Let X1, X2,...,X, denote a random sample of size n > 1 from a distribution 
with pdf f(2;0) = 0e~°", 0 < x < ov, zero elsewhere, and @ > 0. Then Y = 77 X; 
is a sufficient statistic for 0. Prove that (n —1)/Y is the MVUE of 0. 


7.5. The Exponential Class of Distributions 439 


7.5.3. Let X),X2,...,X, denote a random sample of size n from a distribution 
with pdf f(x;0) = 62°-!, 0 < a <1, zero elsewhere, and 6 > 0. 


(a) Show that the geometric mean (X,X2---X,)!/" of the sample is a complete 
sufficient statistic for 0. 


(b) Find the maximum likelihood estimator of 6, and observe that it is a function 
of this geometric mean. 


7.5.4. Let X denote the mean of the random sample X,, X2,..., Xp from a gamma- 
type distribution with parameters a > 0 and G = 6 > 0. Compute E/X, |Z]. 

Hint: Can you find directly a function ~(X) of X such that E[y(X)] = 6? Is 
E(X|Z) = y(Z)? Why? 


7.5.5. Let X be a random variable with the pdf of a regular case of the exponential 
class, given by f(x;0) = exp[0K (a) + H(x) + q(0)], a< a <b, y<0@< 6. Show 
that E[K(X)] = —q'(0)/p'(@), provided these derivatives exist, by differentiating 
both members of the equality 


b 
/ exp[p(0)K (x) + H(x) + 4(0)] dx = 1 


with respect to 0. By a second differentiation, find the variance of K(X). 


7.5.6. Given that f(2;0) = exp[0K(x) + H(x)+q(0)], a<a<b y<0< 5, 
represents a regular case of the exponential class, show that the moment-generating 
function M(t) of Y = K(X) is M(t) = exp[q(@) — q(0+t)], y<O0+t<o. 


7.5.7. In the preceding exercise, given that E(Y) = E|K(X)] = @, prove that Y is 
N(6,1). 
Hint: Consider M'(0) = @ and solve the resulting differential equation. 


7.5.8. If X1, X2,..., Xn isarandom sample from a distribution that has a pdf which 
is a regular case of the exponential class, show that the pdf of Y; = 07 K(X;) is 
of the form fy, (yi; 4) = R(y1) exp[p()y1 + nq(4)].- 

Hint: Let Yo = Xo,...,Yn = Xn be n — 1 auxiliary random variables. Find the 
joint pdf of Y,, Y2,...,Y, and then the marginal pdf of Y;. 


7.5.9. Let Y denote the median and let X denote the mean of a random sample of 
size n = 2k +1 from a distribution that is N(u,07). Compute E(Y|X =7). 
Hint: See Exercise 7.5.4. 


7.5.10. Let X1,X2,...,Xn be a random sample from a distribution with pdf 
f(a; 0) = 0? xe", 0 < x < 00, where 6 > 0. 


(a) Argue that Y = 77 X; is a complete sufficient statistic for 0. 
(b) Compute £(1/Y) and find the function of Y that is the unique MVUE of 0. 


7.5.11. Let X1, X2,...,Xn, n > 2, be a random sample from the binomial distri- 
bution 6(1, 0). 
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(a) Show that Yj = X; + Xo+---+ X,, is a complete sufficient statistic for 0. 
(b) Find the function y(Y;) that is the MVUE of 0. 

(c) Let Y2 = (X, + X2)/2 and compute E(Y9). 

(d) Determine E(Y2|¥ = y1). 


7.5.12. Let X 1, X2,...,X, be a random sample from a distribution with pmf 
p(a;0) = 6*(1— 8), « =0,1,2,..., zero elsewhere, where 0 < 6 < 1. 


(a) Find the mle, 0, of 0. 
(b) Show that 57>) X; is a complete sufficient statistic for 0. 


(c) Determine the MVUE of 6. 


7.6 Functions of a Parameter 


Up to this point we have sought an MVUE of a parameter 6. Not always, however, 
are we interested in @ but rather in a function of 6. There are several techniques 
we can use to the find the MVUE. One is by inspection of the expected value of a 
sufficient statistic. This is how we found the MVUEs in Examples 7.5.2 and 7.5.3 
of the last section. In this section and its exercises, we offer more examples of the 
inspection technique. The second technique is based on the conditional expectation 
of an unbiased estimate given a sufficient statistic. The third example illustrates 
this technique. 

Recall that in Chapter 6 under regularity conditions, we obtained the asymptotic 
distribution theory for maximum likelihood estimators (mles). This allows certain 
asymptotic inferences (confidence intervals and tests) for these estimators. Such 
a straightforward theory is not available for MVUEs. As Theorem 7.3.2 shows, 
though, sometimes we can determine the relationship between the mle and the 
MVUE. In these situations, we can often obtain the asymptotic distribution for the 
MVUE based on the asymptotic distribution of the mle. Also, as we discuss in 
Section 7.6.1, we can usually make use of the bootstrap to obtain standard errors 
for MVUE estimates. We illustrate this for some of the following examples. 


Example 7.6.1. Let X,, X2,...,X, denote the observations of a random sample 
of size n > 1 from a distribution that is 6(1,0), 0 < @ < 1. We know that if 
Y = 30} Xi, then Y/n is the unique minimum variance unbiased estimator of 0. 
Now suppose we want to estimate the variance of Y/n, which is 0(1 — @)/n. Let 
6 = 0(1—6). Because Y is a sufficient statistic for 0, it is known that we can restrict 
our search to functions of Y. The maximum likelihood estimate of 6, which is given 
by 6 = (Y/n)(1—Y/n), is a function of the sufficient statistic and seems to be a 
reasonable starting point. The expectation of this statistic is given by 


n n n n2 


E[é|=E = (1 = ~) = 1 RY) — + Ry?) 
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Now E(Y) = né and E(Y?) = n@(1 — 6) + n?6?. Hence 


Pie (t-2 agin =o 
O-a)]=e-9 


nm nm mr 


If we multiply both members of this equation by n/(n—1), we find that the statistic 
6 = (n/(n — 1))(Y/n)(1 — Y/n) = (n/(n — 1))6 is the unique MVUE of 6. Hence 
the MVUE of 6/n, the variance of Y/n, is 6/n. 

It is interesting to compare the mle 6 with 6. Recall from Chapter 6 that the 
mle 6 is a consistent estimate of 5 and that \/n(d — 6) is asymptotically normal. 
Because 


> 
ee 


it follows that 4 is also a consistent estimator of 6. Further, 


Vn(é — 6) — Vn(é — 6) = ve BEG (7.6.1) 


n-1 


Hence \/n(6 — 5) has the same asymptotic distribution as /n(6 — 5). Using the 
A-method, Theorem 5.2.9, we can obtain the asymptotic distribution of \/n(é — 6). 
Let g(@) = 6(1— 0). Then g/(@) = 1— 26. Hence, by Theorem 5.2.9 and (7.6.1), the 
asymptotic distribution of /n(d — 6) is given by 


Vn(d — 6) % N(0,0(1 — 0)(1 — 20)?), 
provided 6 4 1/2; see Exercise 7.6.12 for the case 0 = 1/2. & 


In the next example, we consider the uniform (0,0) distribution and obtain the 
MVUE for all differentiable functions of 6. This example was sent to us by Professor 
Bradford Crain of Portland State University. 


Example 7.6.2. Suppose X;, X2,...,X», are iid random variables with the com- 
mon uniform (0,0) distribution. Let Y, = max{X1, X2,..., Xn}. In Example 7.4.2, 
we showed that Y, is a complete and sufficient statistic of 9 and the pdf of Y,, is 
given by (7.4.1). Let g(0) be any differentiable function of 0. Then the MVUE of 
g(0) is the statistic u(Y,), which satisfies the equation 


0 ps 
a(0) = [ uy ay, 9 >0, 
or equivalently, 
g(A@)0” = : u(y)ny"' dy, 0>0. 
Differentiating both sides of this equation with respect to 0, we obtain 


n0"—" (0) + 0"q' (0) = u(0)no”""*. 
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Solving for u(@), we obtain 


0q'(0 
u(0) = 9(0) + 22 
n 

Therefore, the MVUE of g(@) is 

Vn 

u(¥n) = 9(Yn) + eo (Yn): (7.6.2) 
For example, if g(0) = 0, then 
Yn 
ian aay, 


which agrees with the result obtained in Example 7.4.2. Other examples are given 
in Exercise 7.6.5. m 


A somewhat different but also very important problem in point estimation is 
considered in the next example. In the example the distribution of a random variable 
X is described by a pdf f(x; 0) that depends upon 6 € . The problem is to estimate 
the fractional part of the probability for this distribution, which is at, or to the left 
of, a fixed point c. Thus we seek an MVUE of F(c;0), where F(x; 0) is the cdf of 
Xx. 


Example 7.6.3. Let X1,X2,...,X, be a random sample of size n > 1 from a 
distribution that is N(@,1). Suppose that we wish to find an MVUE of the function 
of @ defined by 


P(X <c)= I. oo dx = ®(c— 0), 

where cis a fixed constant. There are many unbiased estimators of ®(c—0). We first 

exhibit one of these, say u(X 1), a function of X; alone. We shall then compute the 

conditional expectation, E[u(X1)|X = Z] = y(®), of this unbiased statistic, given 

the sufficient statistic X, the mean of the sample. In accordance with the theorems 

of Rao—Blackwell and Lehmann-Scheffé, y(X) is the unique MVUE of ®(c — 8). 
Consider the function u(x1), where 


ules) = { 1 m<c 


0 w>e. 
The expected value of the random variable u(X,) is given by 
Elu(X1)] =1-P[X, -—60<c— 6] = ®(c— 8). 


That is, u(X1) is an unbiased estimator of ®(c — 6). 

We shall next discuss the joint distribution of X; and X and the conditional 
distribution of X;, given X = Z. This conditional distribution enables us to compute 
the conditional expectation E[u(X1)|X = Z] = y(Z). In accordance with Exercise 
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7.6.8, the joint distribution of X; and X is bivariate normal with mean vector (0, 9), 
variances of = 1 and 03 = 1/n, and correlation coefficient p = 1/./n. Thus the 
conditional pdf of X,, given X = %, is normal with linear conditional mean 


and with variance 


The conditional ae of u(X 1), given X = 7, is then 


nl Eragon R=] 


ie exp [- Seo) & 


The change of variable z = \/n(a1 — %)//n — 1 enables us to write this conditional 
expectation as 


y(Z) 


y 


a 1 _42 J/n(c - Z) 
z) = 124 =9(¢)= 0 |) 
E e z c ; 
v@=f (¢) a 
where c! = V/n(c — )//n—1. Thus the unique MVUE of ®(c — @) is, for every 
fixed constant c, given by y(X) = ®[\/n(ce — X)/V/n — I]. 
In this example the mle of 6(c— 6) is 6(c — X). These two estimators are close 


because \/n/(n — 1) > 1, as n — 00. 


Remark 7.6.1. We should like to draw the attention of the reader to a rather 
important fact. This has to do with the adoption of a principle, such as the principle 
of unbiasedness and minimum variance. A principle is not a theorem; and seldom 
does a principle yield satisfactory results in all cases. So far, this principle has 
provided quite satisfactory results. To see that this is not always the case, let X 
have a Poisson distribution with parameter 0, 0 < 6 < co. We may look upon X as 
a random sample of size 1 from this distribution. Thus X is a complete sufficient 
statistic for 6. We seek the estimator of e~?° that is unbiased and has minimum 
variance. Consider Y = (—1)*. We have 


oe) re-9 
B(Y) = BY-1)*] = > CE = oe 
xz=0 7 


Accordingly, (—1)* is the MVUE of e~?°. Here this estimator leaves much to be 
desired. We are endeavoring to elicit some information about the number e779, 
where 0 < e~?° < 1; yet our point estimate is either —1 or +1, each of which is a 
very poor estimate of a number between 0 and 1. We do not wish to leave the reader 
with the impression that an MVUE is bad. That is not the case at all. We merely 
wish to point out that if one tries hard enough, one can find instances where such 
a statistic is not good. Incidentally, the maximum likelihood estimator of e~?° is, 
in the case where the sample size equals 1, e~?* , which is probably a much better 
estimator in practice than is the unbiased estimator (—1)*. m 
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7.6.1 Bootstrap Standard Errors 


Section 6.3 presented the asymptotic theory of maximum likelihood estimators 
(mles). In many cases, this theory also provides consistent estimators of the asymp- 
totic standard deviation of mles. This allows a simple, but very useful, summary of 
the estimation process; i.e., 9+ SE(9) where @ is the mle of @ and SE(6) is the corre- 
sponding standard error. Bor example, these summaries can be used descriptively as 
labels on plots and tables as well as in the formation of asymptotic confidence inter- 
vals for inference. Section 4.9 presented percentile confidence intervals for 6 based 
on the bootstrap. The bootstrap, though, can also be used to obtain standard errors 
for estimates including MVUE’s. 

Consider a random variable X with pdf f(x;@), where 6 € Q. Let Xy,...,Xn 
be a random sample on X. Let 6 be an estimator of @ based on the sample. 
Suppose 21,...,2%pn is a realization of the sample and let 6 = 6(21, ..+;2%n) be the 
corresponding estimate of 6. Recall in Section 4.9 that the bootstrap uses the 
empirical cdf F,, of the realization. This is the discrete distribution which places 
mass 1/n at each point x;. The bootstrap procedure samples, with replacement, 
from F,. 

For the bootstrap procedure, we obtain B bootstrap samples. For i= 1,...,B, 
let the vector x = (a%7,,...,7,,)’ denote the ith bootstrap sample. Let 6* = 6(x*) 
denote the estimate of 6 based on the ith sample. We then have the bootstrap 
estimates 0%, “3 0% 3, Which we used in Section 4.9 to obtain the bootstrap percentile 
confidence interval for 6. Suppose instead we consider the standard deviation of 
these bootstrap estimates; that is, 


B 1/2 
1 h* Ax\2 
SEs = |5—7 SOG), ) (7.6.3) 


i=1 


where 6* = (1/B) ae 7 6*. This is the bootstrap estimate of the standard error of 
6. 


Example 7.6.4. For this example, we consider a data set drawn from a normal 
distribution, N(0,07). In this case the MVUE of @ is the sample mean X and 
its usual standard error is s/\/n, where s is the sample standard deviation. The 
rounded data! are: 

27.5 50.9 71.1 43.1 40.4 44.8 36.6 53.5 65.2 47.7 

75.7 55.4 61.1 39.8 33.4 57.6 47.9 60.7 27.8 65.2 
Assuming the data are in the R vector x, the mean and standard error are computed 
as 

mean(x); 50.27; sd(x)/sqrt(n); 3.094461 
The R function bootse1.R runs the bootstrap for standard errors as described 
above. Using 3,000 bootstraps, our run of this function estimated the standard error 
by 3.050878. Thus, the estimate and the bootstrap standard error are summarized 
as 50.27 + 3.05. 


1The data are in the file sect76data.rda. The true mean and sd are: 50 and 15. 
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The bootstrap process described above is often called the nonparametric boot- 
strap because it makes no assumptions about the pdf f(a;0). In this chapter, 
though, strong assumptions are made about the model. For instance, in the last 
example, we assume that the pdf is normal. What if we make use of this information 
in the bootstrap? This is called the parametric bootstrap. For the last example, 
instead of sampling from the empirical cdf F;,, we sample randomly from the nor- 
mal distribution, using as mean Z and as standard deviation s, the sample standard 
deviation. The R function bootse2.R performs this parametric bootstrap. For our 
run on the data set in the example, it computed the standard error as 3.162918. 
Notice how close the three estimated standard deviations are. 

Which bootstrap, nonparametric or parametric, should we use? We recommend 
the nonparametric bootstrap in general. The strong model assumptions are not 
needed for its validity. See pages 55—56 of Efron and Tibshirani (1993) for discussion. 


EXERCISES 


7.6.1. Let X 1, Xo,...,X, denote a random sample from a distribution that is 
N(6,1), —oo < @ <0. Find the MVUE of 6?. 


Hint: First determine E(X ). 


7.6.2. Let X 1, Xo,...,Xn denote a random sample from a distribution that is 
N(0,0). Then Y = 5> X? is a complete sufficient statistic for 6. Find the MVUE 
of 67. 


7.6.3. Consider Example 7.6.3 where the parameter of interest is P(X < c) for X 
distributed N(0,1). Modify the R. function bootse1.R so that for a specified value 
of c it returns the MVUE of P(X < c) and the bootstrap standard error of the 
estimate. Run your function on the data in ex763data.rda with c = 11 and 3,000 
bootstraps. These data are generated from a N(10,1) distribution. Report (a) the 
true parameter, (b) the MVUE, and (c) the bootstrap standard error. 


7.6.4. For Example 7.6.4, modify the R function bootse1.R so that the estimate 
is the median not the mean. Using 3,000 bootstraps, run your function on the data 
set discussed in the example and report (a) the estimate and (b) the bootstrap 
standard error. 


7.6.5. Let X1,X2,...,X»n be arandom sample from a uniform (0,6) distribution. 
Continuing with Example 7.6.2, find the MVUEs for the following functions of 6. 


(a) g(0) = . i.e., the variance of the distribution. 
(b) 9(0) = Z, ie., the pdf of the distribution. 
(c) For t real, g(@) = cael i.e., the mgf of the distribution. 


7.6.6. Let X,,X2,...,X, be a random sample from a Poisson distribution with 
parameter 0 > 0. 
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Find the MVUE of P(X < 1) = (1+ 6)e~?. 


Hint: Let u(a1) = 1, 21 < 1, zero elsewhere, and find E[u(X1)|Y = y], 
where Y = 0} Xj. 


(a 


wa 


(b) Express the MVUE as a function of the mle of 0. 
(c) Determine the asymptotic distribution of the mle of 0. 


(d) Obtain the mle of P(X < 1). Then use Theorem 5.2.9 to determine its 
asymptotic distribution. 


7.6.7. Let X,, X2,...,X, denote a random sample from a Poisson distribution 
with parameter 6 > 0. From Remark 7.6.1, we know that E[(—1)*1] = e~??. 


(a) Show that E[(—1)**|Y, = y] = (1—2/n)™, where Y) = X1 + Xo+---+ Xp. 
Hint: First show that the conditional pdf of X,, X2,...,Xn—1, given Y; = yi, 
is multinomial, and hence that of X1, given Y; = yi, is b(yi,1/n). 


(b) Show that the mle of e728 is e~2*, 


(c) Since y; = nz, show that (1 — 2/n)¥! is approximately equal to e~?” when n 


is large. 


7.6.8. As in Example 7.6.3, let X1,X2,...,Xn be a random sample of size n > 1 
from a distribution that is N(0,1). Show that the joint distribution of X; and X 
is bivariate normal with mean vector (6,0), variances 0? = 1 and o3 = 1/n, and 
correlation coefficient p = 1/,/n. 


7.6.9. Let a random sample of size n be taken from a distribution that has the pdf 
f(a; 0) = (1/0) exp(—2/6)I(o,.0)(x). Find the mle and MVUE of P(X < 2). 


7.6.10. Let X 1, X2,...,X», be a random sample with the common pdf f(z) = 
6-1e-*/° for x > 0, zero elsewhere; that is, f(x) is a P(1,0) pdf. 


(a) Show that the statistic X = n~1!)~"_, X; is a complete and sufficient statistic 
for 6. 


(b) Determine the MVUE of 0. 
(c) Determine the mle of 0. 


(d) Often, though, this pdf is written as f(a) = 7e~7”, for x > 0, zero elsewhere. 
Thus 7 = 1/0. Use Theorem 6.1.2 to determine the mle of rT. 


(e) Show that the statistic X =n7'>;"_, X; is a complete and sufficient statistic 
for r. Show that (n — 1)/(nX) is the MVUE of 7 = 1/0. Hence, as usual, 
the reciprocal of the mle of @ is the mle of 1/6, but, in this situation, the 
reciprocal of the MVUE of @ is not the MVUE of 1/0. 


(f) Compute the variances of each of the unbiased estimators in parts (b) and 


(e). 


7.7. The Case of Several Parameters 447 


7.6.11. Consider the situation of the last exercise, but suppose we have the following 
two independent random samples: (1) X1, X2,...,Xn is arandom sample with the 
common pdf fx(x) = 6~te-*/°, for x > 0, zero elsewhere, and (2) Yj, Y2,...,¥p is 
a random sample with common pdf fy(y) = 0e~°”, for y > 0, zero elsewhere. The 
last exercise suggests that, for some constant c, Z = cX /Y might be an unbiased 
estimator of 62. Find this constant c and the variance of Z. 

Hint: Show that X/(@?Y) has an F-distribution. 


7.6.12. Obtain the asymptotic distribution of the MVUE in Example 7.6.1 for the 
case 0 = 1/2. 


7.7 The Case of Several Parameters 


In many of the interesting problems we encounter, the pdf or pmf may not depend 
upon a single parameter 0, but perhaps upon two (or more) parameters. In general, 
our parameter space 2 is a subset of R?, but in many of our examples p is 2. 


Definition 7.7.1. Let X1, X2,...,Xn denote a random sample from a distribution 
that has pdf or pmf f(a;0), where @€ QC RP. Let S denote the support of X. 
Let Y be an m-dimensional random vector of statistics Y = (Yi,...,¥m)/, where 
Y;, = uj(X1, Xo,...,Xn), fori =1,...,m. Denote the pdf or pmf of Y by fy(y;) 
for y € R™. The random vector of statistics Y is jointly sufficient for 0 if and 


only if 
Ti f(ee 8) 
at = (01, %2,---,%n), for all x; €S, 
fy (y; 9) i ne 
where H(a1,%2,...,Un) does not depend upon 6. 


In general, m # p, i.e., the number of sufficient statistics does not have to be 
the same as the number of parameters, but in most of our examples this is the case. 

As may be anticipated, the factorization theorem can be extended. In our nota- 
tion it can be stated in the following manner. The vector of statistics Y is jointly 
sufficient for the parameter @ € 2 if and only if we can find two nonnegative func- 
tions k, and kg such that 


n 


[| fs: 9) = kily;@)ko(a1,...,2n), for all a; € S, (7.7.1) 
i=l 
where the function kg(#1,72,...,2@n) does not depend upon @. 


Example 7.7.1. Let X), Xo,...,X, be arandom sample from a distribution hav- 
ing pdf 


1 
sy 1-02 <2 < +05 
; —) % 
f(z; 01,62) = { 0. elsewhere, 


where —oo < 4; < oo, 0< 42 < oo. Let Y, < Yo <--- < Y, be the order statistics. 
The joint pdf of Y; and Y,, is given by 


F¥i,¥a(Y1, ns 1, 82) = “aye (Yn —y)"*, 01-62< 41 < Yn < 01 + 02, 
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and equals zero elsewhere. Accordingly, the joint pdf of X1,X2,...,Xn can be 
written, for all points in its support (all x; such that 0; — 62 < a; < 0, + 62), 


( 1 )’ __ n(n = 1)[max(x;) — min(e;)]"~? ( 1 | 


205 (202)” n(n — 1)[max(a;) — min(a;)|"~2 


Since min(z;) < az; < max(z;), 7 = 1,2,...,n, the last factor does not depend 
upon the parameters. Either the definition or the factorization theorem assures us 
that Y; and Y, are joint sufficient statistics for 0, and 02. m 


The concept of a complete family of probability density functions is generalized 
as follows: Let 
{f (v1, v2, --+, UK} 8) te QO} 


denote a family of pdfs of k random variables Vj, V2,..., Vi that depends upon the 
p-dimensional vector of parameters 0 € Q. Let u(v1,v2,..., 0%) be a function of 
V1, 02,---, Uz (but not a function of any or all of the parameters). If 


Elu(Y, Vo,...,Ve)] = 0 


for all 8 € Q implies that u(v1, v2,...,v%) = 0 at all points (v1, v2,..., UK), except on 
a set of points that has probability zero for all members of the family of probability 
density functions, we shall say that the family of probability density functions is a 
complete family. 

In the case where @ is a vector, we generally consider best estimators of functions 
of 0, that is, parameters 6, where 6 = g(@) for a specified function g. For example, 
suppose we are sampling from a N (61, 62) distribution, where 62 is the variance. Let 
6 = (61,02)/ and consider the two parameters 6; = gi(@) = 0; and 62 = go2(@) = 
\/02. Hence we are interested in best estimates of 5, and 6g. 

The Rao—Blackwell, Lehmann—Scheffé theory outlined in Sections 7.3 and 7.4 
extends naturally to this vector case. Briefly, suppose 6 = g(@) is the parameter 
of interest and Y is a vector of sufficient and complete statistics for 0. Let T be 
a statistic that is a function of Y, such as T = T(Y). If E(T) = 0, then T is the 
unique MVUE of 6. 

The remainder of our treatment of the case of several parameters is restricted to 
probability density functions that represent what we shall call regular cases of the 
exponential class. Here m = p. 


Definition 7.7.2. Let X be a random variable with pdf or pmf f(x;@), where 
the vector of parameters 0 € QC R™. Let S denote the support of X. If X is 
continuous, assume that S = (a,b), where a or b may be —oo or co, respectively. If 
X is discrete, assume that S = {a1,d2,...}. Suppose f(x;@) is of the form 


oe exp be p; (0) Kj (x) + W(x) + q(61,02,...,8m)| for alla eS 
0 elsewhere. 
(7.7.2) 
Then we say this pdf or pmf is a member of the exponential class. We say it is 
a regular case of the exponential family if, in addition, 
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1. the support does not depend on the vector of parameters 0, 
2. the space Q contains a nonempty, m-dimensional open rectangle, 


3. the p;(@), 7 =1,...,m, are nontrivial, functionally independent, continuous 
functions of 0, 


4. and, depending on whether X is continuous or discrete, one of the following 
holds, respectively: 


(a) if X is a continuous random variable, then the m derivatives K(x), for 
j = 1,2,...,m, are continuous fora < x < b and no one is a linear 
homogeneous function of the others, and H(a) is a continuous function 
ofx,a<au<b. 

(b) if X is discrete, the K;(x), j = 1,2,...,m, are nontrivial functions of 
x on the support S and no one is a linear homogeneous function of the 
others. 


Let X1,...,X, be arandom sample on X where the pdf or pmf of X is a regular 
case of the exponential class with the same notation as in Definition 7.7.2. It follows 
from (7.7.2) that the joint pdf or pmf of the sample is given by 


n m 


[] (ei: ®) = exp |S /p5() D | Ki (ai) + rq(8) | exp > i) ¢ 72) 


i=l 


for all x; € S. In accordance with the factorization theorem, the statistics 
Me) Die Be) BG) nas) Bale 
i=1 i=1 i=1 


are joint sufficient statistics for the m-dimensional vector of parameters @. It is left 
as an exercise to prove that the joint pdf of Y = (Yi,..., Yin)’ is of the form 


R(y) exp > P5(8)y; +418) (7.7.4) 


at points of positive probability density. These points of positive probability density 
and the function R(y) do not depend upon the vector of parameters 8. Moreover, 
in accordance with a theorem in analysis, it can be asserted that in a regular case 
of the exponential class, the family of probability density functions of these joint 
sufficient statistics Y1, Y2,...,Ym is complete when n > m. In accordance with a 
convention previously adopted, we shall refer to Y|, Y2,..., Ym as joint complete 
sufficient statistics for the vector of parameters 0. 


Example 7.7.2. Let X1, X2,...,X, denote a random sample from a distribution 
that is N(01,02), —co < 6; < 00, 0 < 02 < co. Thus the pdf f(a; 61,62) of the 
distribution may be written as 


—1 7 6? 
f(a; 01, 02) = exp ( a? + Sa — — —Inv/2r6;). 
A A 205 
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Therefore, we can take K(x) = x? and K2(x) = x. Consequently, the statistics 


Yj =x? and 6-50 
1 1 


are joint complete sufficient statistics for 6; and 02. Since the relations 


Y= Y, — ¥? X,-—Xy 
Z,=-—2 =X, Ee 
n n-1 n—-1 
define a one-to-one transformation, Z, and Z»2 are also joint complete sufficient 
statistics for 6; and 02. Moreover, 


E(Z,) = 0; and E(Z2) = Ao. 


From completeness, we have that Z, and Z2 are the only functions of Y; and Y2 
that are unbiased estimators of 6, and 02, respectively. Hence Z, and Z are the 
unique minimum variance estimators of 0, and 62, respectively. The MVUE of the 
standard deviation \/@2 is derived in Exercise 7.7.5. ™ 


In this section we have extended the concepts of sufficiency and completeness 
to the case where @ is a p-dimensional vector. We now extend these concepts to 
the case where X is a k-dimensional random vector. We only consider the regular 
exponential class. 

Suppose X is a k-dimensional random vector with pdf or pmf f(x;0), where 
9€QC R?. Let S C R* denote the support of X. Suppose f(x; 0) is of the form 


Abi exp baw 1 Pi(O)Kj(x) + H(x) +4(8)] for allxeS zy 
0 elsewhere. 


Then we say this pdf or pmf is a member of the exponential class. If, in addition, 
p =m, the support does not depend on the vector of parameters 0, and conditions 
similar to those of Definition 7.7.2 hold, then we say this pdf is a regular case of 
the exponential class. 


Suppose that X,...,X,, constitute a random sample on X. Then the statistics, 
Y=) A) forga=1sm, (7.7.6) 
i=1 


are sufficient and complete statistics for 8. Let Y = (Yi,..., Ym)’. Suppose 6 = g(@) 
is a parameter of interest. If T = h(Y) for some function h and F(T) = 6 then T 
is the unique minimum variance unbiased estimator of 6. 


Example 7.7.3 (Multinomial). In Example 6.4.5, we consider the mles of the 
multinomial distribution. In this example we determine the MVUEs of several of 
the parameters. As in Example 6.4.5, consider a random trial that can result in one, 
and only one, of k outcomes or categories. Let X; be 1 or 0 depending on whether 
the jth outcome does or does not occur, for 7 = 1,...,&. Suppose the probability 
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/ 


that outcome j occurs is p;; hence, 1 Pi = 1. Let X = (X%,...,X,%-1)/ and 
p = (pi,---,;pr—1)’. The distribution of X is multinomial and can be found in 
expression (6.4.18), which can be reexpressed as 


k-1 
= Di = 
f(x, Pp) = exp CS (u []) x; +log | 1 Soi 


j=l itxk 


Because this a regular case of the exponential family, the following statistics, re- 
sulting from a random sample X,,...,X, from the distribution of X, are jointly 
sufficient and complete for the parameters p = (pi,...,DR—1)’: 


Y=) Xj, for j= 1,..,k~1. 


i=l 


Each random variable X;; is Bernoulli with parameter p; and the variables X;; 
are independent for i = 1,...,n. Hence the variables Y; are binomial(n,p,;) for 
j=1,...,k. Thus the MVUE of p; is the statistic n—Y;. 

Next, we shall find the MVUE of pj, for 7 #1. Exercise 7.7.8 shows that the 
mle of p;py is n~*Y;Y;. Recall from Section 3.1 that the conditional distribution of 
Y;, given Y;, is b[n — Yi, p;/(1 — z)]. As an initial guess at the MVUE, consider the 
mle, which, as shown by Exercise 7.7.8, is n~*Y;Y;. Hence 
1 


 B[B(Y;¥i(%i)] = EM BOG) 


B[n-Y;¥i] 


1 1 
= SE|¥(n-v)AL| = = {x Inyi] - BLY? )} 
n l1-pm] n’l—p 


1 Db 2 2,2 
= = = ae 
tt er pi — npi(l — pi) — n“ pz } 
1 Dp; n—-1 
= = npi(n-1)(1-p) = ( doi. 
n-1l—p n 


Hence the MVUE of p;p, is aay YG Mi a 


Example 7.7.4 (Multivariate Normal). Let X have the multivariate normal distri- 
bution N;,(u, =), where & is a positive definite k x k matrix. The pdf of X is given 
in expression (3.5.16). In this case @ is a {k+[k(k+1)/2]}-dimensional vector whose 
first & components consist of the mean vector ys and whose last aU u components 
consist of the componentwise variances o? and the covariances o;;, for j > i. The 


density of X can be written as 


1 1 1 k 
fx (x) = exp {-gx'E x +p'S x Le = log || — 5 log 27 p, 
(7.7.7) 
for x € R*. Hence, by (7.7.5), the multivariate normal pdf is a regular case of the 


exponential class of distributions. We need only identify the functions K(x). The 
second term in the exponent on the right side of (7.7.7) can be written as (y/~')x; 
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hence, Ki(x) = x. The first term is easily seen to be a linear combination of the 
products #;2,;, i,j =1,2,...,k, which are the entries of the matrix xx’. Hence we 
can take K2(x) = xx’. Now, let Xj,...,X, be a random sample on X. Based on 
(7.7.7) then, a set of sufficient and complete statistics is given by 


i=1 


i=1 


Note that Y, is a vector of & statistics and that Y2 is a k x k symmetric matrix. 
Because the matrix is symmetric, we can eliminate the bottom-half [elements (7, 7) 
with i > J] of the matrix, which results in {k + [k(& + 1)]} complete sufficient 
statistics, i.e., as many complete sufficient statistics as there are parameters. 

Based on marginal distributions, it is easy to show that X; =n! 30, Xi; is 
the MVUE of pj and that (n — 1)~! S77, (Xij — Xj)? is the MVUE of o7. The 
MVUEs of the covariance parameters are obtained in Exercise 7.7.9. @ 


For our last example, we consider a case where the set of parameters is the cdf. 


Example 7.7.5. Let X,, X2,...,X, be arandom sample having the common con- 
tinuous cdf F(a). Let Yj < Yo <--- < Y, denote the corresponding order statistics. 
Note that given Yy = y1, Yo = yo,---,¥Yn = Yn, the conditional distribution of 


X 1, X2,...,Xn is discrete with probability os on each of the n! permutations of 
the vector (y1,y2,---,Yn), [because F(x) is continuous, we can assume that each 
of the values y1, y2,---,Yn is distinct]. That is, the conditional distribution does 


not depend on F(x). Hence, by the definition of sufficiency, the order statistics are 
sufficient for F(a). Furthermore, while the proof is beyond the scope of this book, 
it can be shown that the order statistics are also complete; see page 72 of Lehmann 
and Casella (1998). 


Let T = T(x1,%2,...,2n) be any statistic that is symmetric in its arguments; 
Le, T(v1,¥9,...,Un) = T(ai,,2i,,---, 2, ) for any permutation (;,,7;,,..., Xi, ) 
of (41,¥2,...,%n). Then T is a function of the order statistics. This is useful in 


determining MVUEs for this situation; see Exercises 7.7.12 and 7.7.13. 


EXERCISES 


7.7.1. Let Y; < Yo < Y3 be the order statistics of a random sample of size 3 from 
the distribution with pdf 


= x—O, 


1 _ 
f(a;0;, 02) = z exp ( =) 0,.<"a4<co, —-wW< A, <~w, 0< 42 <0co 
elsewhere. 


Find the joint pdf of 7, = Y1, Z2 = Yo, and Z3 = Y, + Y2 + Y3. The corresponding 
transformation maps the space {(y1, y2,y3) : 01 < yi < yo < y3 < oo} onto the 
space 

{(21, 22,23) : 01 < 21 < 22 < (23 — 21)/2 < co}. 


Show that Z, and Zs are joint sufficient statistics for 0; and 02. 
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7.7.2. Let X1,Xo,...,Xn be a random sample from a distribution that has a 
pdf of the form (7.7.2) of this section. Show that Yj = S70", Ki(Xi),...-,¥m = 
ye, Km(Xi) have a joint pdf of the form (7.7.4) of this section. 


7.7.3. Let (X1, Yi), (X2, Yo),...,(Xn, Yn) denote a random sample of size n from 
a bivariate normal distribution with means ji; and jug, positive variances o? and 
o3, and correlation coefficient p. Show that 3°) X;, 7 Yi, 37 X?, OY 7, and 
So) X.Y; are joint complete sufficient statistics for the five parameters. Are X = 

*X./n, Y= SPYi/n, 82 = (Xi — X)P?/(n— 1), 82 = EH - V)?/(n— 1), 
and $7/'(X; — X)(Y; — Y)/(n — 1)$1S>2 also joint complete sufficient statistics for 
these parameters? 


7.7.4. Let the pdf f(a; 01,02) be of the form 


exp[p1 (01, 02) Ky (a) + po(A1, 02) Ko(ax) + H(a) + q(o1, 2)I, ax<zt< b, 


zero elsewhere. Suppose that Kj (a) = cK$(a). Show that f(x; 1, 62) can be written 
in the form 
exp[p(61, 02) Ko(x) + H(x) + q(01,82)], a<a<b, 


zero elsewhere. This is the reason why it is required that no one Kj(x) be a lin- 
ear homogeneous function of the others, that is, so that the number of sufficient 
statistics equals the number of parameters. 


7.7.5. In Example 7.7.2: 
(a) Find the MVUE of the standard deviation //@2. 


(b) Modify the R function bootse1.R so that it returns the estimate in (a) and 
its bootstrap standard error. Run it on the Bavarian forest data discussed 
in Example 4.1.3, where the response is the concentration of sulfur dioxide. 
Using 3,000 bootstraps, report the estimate and its bootstrap standard error. 


7.7.6. Let X1, X2,...,Xp be a random sample from the uniform distribution with 
pdf f(x; 61, 02) = 1/(262), 6, —02 < x < 6; + 2, where —oo < 6; < o6 and 62 > 0, 
and the pdf is equal to zero elsewhere. 


(a) Show that Y; = min(X;) and Y,, = max(X;), the joint sufficient statistics for 
6, and 62, are complete. 


(b) Find the MVUEs of 6; and 63. 
7.7.7. Let X1, X2,...,X» be a random sample from N (61, 42). 


(a) If the constant 6 is defined by the equation P(X < b) = p where p is specified, 
find the mle and the MVUE of b. 


(b) Modify the R function bootse1.R so that it returns the MVUE of Part (a) 
and its bootstrap standard error. 


(c) Run your function in Part (b) on the data set discussed in Example 7.6.4 for 
p = 0.75 and 3,000 bootstraps. 
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7.7.8. In the notation of Example 7.7.3, show that the mle of p;p; is n~7Y;Yi. 
7.7.9. Refer to Example 7.7.4 on sufficiency for the multivariate normal model. 
(a) Determine the MVUE of the covariance parameters o;;. 
(b) Let g = ae aj fi, Where a1,..., a, are specified constants. Find the MVUE 
for g. 


7.7.10. In a personal communication, LeRoy Folks noted that the inverse Gaussian 


pdf 
; = A oe —O2(x = 6,)? 
f (x; 61, 02) — (5) exp 22x 


where 6; > 0 and 62 > 0, is often used to model lifetimes. Find the complete 
sufficient statistics for (01,02) if X1,X2,...,Xp is a random sample from the dis- 
tribution having this pdf. 


| , 0<zr<aM, (7.7.9) 


7.7.11. Let X1, X2,...,X», be a random sample from a N (61,62) distribution. 
(a) Show that E[(X1 — 01)4] = 363. 
(b) Find the MVUE of 362. 


7.7.12. Let X1,...,X» be a random sample from a distribution of the continuous 
type with cdf F(a). Suppose the mean, p = E(X;1), exists. Using Example 7.7.5, 
show that the sample mean, X = n~' S>j_, Xi, is the MVUE of p. 


7.7.13. Let X1,...,Xp be a random sample from a distribution of the continuous 
type with cdf F(x). Let 0 = P(X, < a) = F(a), where a is known. Show that the 
proportion n-!#{X; < a} is the MVUE of 0. 


7.8 Minimal Sufficiency and Ancillary Statistics 


In the study of statistics, it is clear that we want to reduce the data contained in 
the entire sample as much as possible without losing relevant information about the 
important characteristics of the underlying distribution. That is, a large collection 
of numbers in the sample is not as meaningful as a few good summary statistics of 
those data. Sufficient statistics, if they exist, are valuable because we know that 
the statisticians with those summary measures have as much information as the 
statistician with the entire sample. Sometimes, however, there are several sets of 
joint sufficient statistics, and thus we would like to find the simplest one of these sets. 
For illustration, in a sense, the observations X1, X2,...,Xn, n > 2, of a random 
sample from N (01,02) could be thought of as joint sufficient statistics for 6; and 02. 
We know, however, that we can use X and S$? as joint sufficient statistics for those 
parameters, which is a great simplification over using X1, X2,..., Xn, particularly 
if n is large. 

In most instances in this chapter, we have been able to find a single sufficient 
statistic for one parameter or two joint sufficient statistics for two parameters. Pos- 
sibly the most complicated cases considered so far are given in Example 7.7.3, in 
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which we find k+k(k+1)/2 joint sufficient statistics for k + k(k+1)/2 parameters; 
or the multivariate normal distribution given in Example 7.7.4; or in the use the 
order statistics of a random sample for some completely unknown distribution of 
the continuous type as in Example 7.7.5. 

What we would like to do is to change from one set of joint sufficient statistics 
to another, always reducing the number of statistics involved until we cannot go 
any further without losing the sufficiency of the resulting statistics. Those statistics 
that are there at the end of this reduction are called minimal sufficient statis- 
tics. These are sufficient for the parameters and are functions of every other set 
of sufficient statistics for those same parameters. Often, if there are k parameters, 
we can find & joint sufficient statistics that are minimal. In particular, if there is 
one parameter, we can often find a single sufficient statistic that is minimal. Most 
of the earlier examples that we have considered illustrate this point, but this is not 
always the case, as shown by the following example. 


Example 7.8.1. Let X,, X2,...,X, be a random sample from the uniform distri- 
bution over the interval (0 — 1,@+ 1) having pdf 


f(2;9) = ($)Lo-1,041)(), | where — 00 < 0 < ov. 


The joint pdf of X1, X2,...,Xn equals the product of (3)” and certain indicator 
functions, namely, 


(5) [terol =($) Ue-soelmin(a))-Ueso+ylmaxte,)) 


because @— 1 < min(z;) < 2; < max(z,;) <@+1, 7 =1,2,...,n. Thus the order 
statistics Y; = min(X;) and Y,, = max(X;) are the sufficient statistics for 6. These 
two statistics actually are minimal for this one parameter, as we cannot reduce the 
number of them to less than two and still have sufficiency. 


There is an observation that helps us see that almost all the sufficient statistics 
that we have studied thus far are minimal. We have noted that the mle 6 of 6 is 
a function of one or more sufficient statistics, when the latter exists. Suppose that 
this mle 6 is also sufficient. Since this sufficient statistic 6 is a function of the other 
sufficient statistics, by Theorem 7.3.2, it must be minimal. For example, we have 


1. The mle 6 = X of 6 in N(0,0?), o? known, is a minimal sufficient statistic 
for 6. 


2. The mle 6 = X of 6 in a Poisson distribution with mean @ is a minimal 
sufficient statistic for 0. 


3. The mle 6 = Y, = max(X;) of @ in the uniform distribution over (0,9) is a 
minimal sufficient statistic for 0. 


4. The maximum likelihood estimators 6; =X and 62 = [(n —1)/n]S? of 6, and 
62 in N(61, 62) are joint minimal sufficient statistics for 0, and 62. 
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From these examples we see that the minimal sufficient statistics do not need 
to be unique, for any one-to-one transformation of them also provides minimal 
sufficient statistics. The linkage between minimal sufficient statistics and the mle, 
however, does not hold in many interesting instances. We illustrate this in the next 
two examples. 


Example 7.8.2. Consider the model given in Example 7.8.1. There we noted that 
Y, = min(X;) and Y,, = max(X;) are joint sufficient statistics. Also, we have 


6-1<Y,<Y,<6041 


or, equivalently, 
Y, -1<0<Y,+1. 


Hence, to maximize the likelihood function so that it equals ($)", @ can be any 
value between Y,, — 1 and Y; +1. For example, many statisticians take the mle to 


be the mean of these two endpoints, namely, 


2 2% 
which is the midrange. We recognize, however, that this mle is not unique. Some 
might argue that since 6 is an mle of 6 and since it is a function of the joint sufficient 
statistics, Y; and Y,,, for 0, it is a minimal sufficient statistic. This is not the case at 
all, for 6 is not even sufficient. Note that the mle must itself be a sufficient statistic 
for the parameter before it can be considered the minimal sufficient statistic. m 


Note that we can model the situation in the last example by 
Xi = 0+ Wi, (7.8.1) 


where Wi, W2,...,W,, are iid with the common uniform(—1,1) pdf. Hence this is 
an example of a location model. We discuss these models in general next. 


Example 7.8.3. Consider a location model given by 
Xi =04+Wi, (7.8.2) 


where W,, W2,...,W, are iid with the common pdf f(w) and common continuous 
cdf F(w). From Example 7.7.5, we know that the order statistics Y1 < Yo <---< Y, 
are a set of complete and sufficient statistics for this situation. Can we obtain a 
smaller set of minimal sufficient statistics? Consider the following four situations: 


(a) Suppose f(w) is the N(0,1) pdf. Then we know that X is both the MVUE 
and mle of @. Also, X = n~!)7i_, Yi, i-e., a function of the order statistics. 
Hence X is minimal sufficient. 


(b) Suppose f(w) = exp{—w}, for w > 0, zero elsewhere. Then the statistic Yj is 
a sufficient statistic as well as the mle, and thus is minimal sufficient. 
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(c) Suppose f(w) is the logistic pdf. As discussed in Example 6.1.2, the mle of 0 
exists and it is easy to compute. As shown on page 38 of Lehmann and Casella 
(1998), though, the order statistics are minimal sufficient for this situation. 
That is, no reduction is possible. 


(d) Suppose f(w) is the Laplace pdf. It was shown in Example 6.1.1 that the 
median, Q2 is the mle of #, but it is not a sufficient statistic. Further, similar 
to the logistic pdf, it can be shown that the order statistics are minimal 
sufficient for this situation. m 


In general, the situation described in parts (c) and (d), where the mle is obtained 
rather easily while the set of minimal sufficient statistics is the set of order statistics 
and no reduction is possible, is the norm for location models. 


There is also a relationship between a minimal sufficient statistic and complete- 
ness that is explained more fully in Lehmann and Scheffé (1950). Let us say simply 
and without explanation that for the cases in this book, complete sufficient statistics 
are minimal sufficient statistics. The converse is not true, however, by noting that 
in Example 7.8.1, we have 

Y¥,—-Y. n-1 


fe |e 
2 n+1 


=0, forall 0. 


That is, there is a nonzero function of those minimal sufficient statistics, Y; and Y,,, 
whose expectation is zero for all 6. 

There are other statistics that almost seem opposites of sufficient statistics. 
That is, while sufficient statistics contain all the information about the parameters, 
these other statistics, called ancillary statistics, have distributions free of the 
parameters and seemingly contain no information about those parameters. As an 
illustration, we know that the variance S$? of a random sample from N(6,1) has 
a distribution that does not depend upon @ and hence is an ancillary statistic. 
Another example is the ratio Z = X/(X1+X2), where X 1, X2 is a random sample 
from a gamma distribution with known parameter a > 0 and unknown parameter 
B = 86, because Z has a beta distribution that is free of 6. There are many examples 
of ancillary statistics, and we provide some rules that make them rather easy to find 
with certain models, which we present in the next three examples. 


Example 7.8.4 (Location-Invariant Statistics). In Example 7.8.3, we introduced 


the location model. Recall that a random sample Xj, X2,..., Xn follows this model 
if 

where —oo < 0 < o is a parameter and Wj, W2,...,W,, are iid random variables 
with the pdf f(w), which does not depend on 6. Then the common pdf of X; is 
f(a — 9). 


Let Z = u(X1, Xo,..., Xn) be a statistic such that 


u(a, +d,ro4+d,...,U, +d) = u(a1,%2,...,2n), 
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for all real d. Hence 
Z=u(W, + 0,W24+6,...,Wr +8) = u(Wi, Wo,..., Wr) 


is a function of W1, W2,...,W, alone (not of 0). Hence Z must have a distribution 
that does not depend upon 6. We call Z = u(X1, Xo,..., Xn) a location-invariant 
statistic. 

Assuming a location model, the following are some examples of location-invariant 
statistics: the sample variance = S$”, the sample range = max{ X;} — min{ X;}, the 
mean deviation from the sample median = (1/n) S> |X; — median(X;)|, X1 + X2—- 
X3— X4, X1 + X3— 2X2, (1/n) So[X; — min(X;)], and so on. To see that the range 
is location-invariant, note that 


max{X;}—@ = max{X;— 06} = max{W;} 
min{X;}—0 min{X; — 0} = min{W;}. 


So, 
range = max{ X;}—min{X;} = max{ X;}—0—(min{X;}—0) = max{W;}—min{W;}. 


Hence the distribution of the range only depends on the distribution of the Wis 
and, thus, it is location-invariant. For the location invariance of other statistics, see 
Exercise 7.8.4. 


Example 7.8.5 (Scale-Invariant Statistics). Consider a random sample Xj,..., Xn 
that follows a scale model, i.e., a model of the form 


where 0 > 0 and Wi, W2,...,W, are iid random variables with pdf f(w), which 
does not depend on 6. Then the common pdf of X; is 0~'f(a/@). We call 6 a scale 
parameter. Suppose that Z = u(X1, Xo,..., Xn) is a statistic such that 


u(cx1,C@2,...,CXn) = U(@1, 2,.--, Ln) 
for allc > 0. Then 
Z= u(X1, Xo, soe Xn) — u(OW,, OWa, soe ,OW,,) = u(W1, Wa, + ig Wa) 


Since neither the joint pdf of W,,W2,...,W, nor Z contains 0, the distribution of 
Z must not depend upon @. We say that Z is a scale-invariant statistic. 
The following are some examples of scale-invariant statistics: X /(X1 + X2), 


X?/ 30) X?, min(X;)/ max(X;), and so on. The scale invariance of the first statistic 


follows from 
Xy (0X1)/0 WwW, 


Xit+X2 [(0X1) + (OX2))/9 Wit Wa 


The scale invariance of the other statistics is asked for in Exercise 7.8.5. ™ 
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Example 7.8.6 (Location- and Scale-Invariant Statistics). Finally, consider a ran- 
dom sample X1, X9,...,X,, that follows a location and scale model as in Example 
7.7.5. That is, 

Xj, = 60, + 02.Wij, t= Aywiegts (7.8.5) 


where W; are iid with the common pdf f(t) which is free of 6; and 62. In this case, 
the pdf of X; is 07‘ f((a — 01) /02). Consider the statistic Z = u(X1, Xo,..., Xn) 
where 


u(ca, + d,...,C&@p +d) = u(a1,...,2n). 
Then 
Z= u(X1,...,Xn) —_ u(Oy + 0.W,,. ..5 94 + 02W,,) = u(Wi,. ..,Wy). 


Since neither the joint pdf of W1,...,W» nor Z contains 6; and 62, the distribution 
of Z must not depend upon 4; nor 62. Statistics such as Z = u(X1, X2,..., Xn) are 
called location- and scale-invariant statistics. The following are four examples 
of such statistics: 


(a) T, = [max(X;) — min(X;)]/S; 
(b) To = oy (Xiaa — Xi)?/S?: 
(c) T3 = (X;—-X)/S; 

(d) Ty = |Xi — X5|/S,, 54 F 3. 


Let X — 6; = n~1 >", (X; — 01). Then the location and scale invariance of the 
statistic in (d) follows from the two identities 


n => 2 n 
X;—-6 xX —@ — 
= a>| ; be - ; = 6; 5_(W; - W)? 
i=1 2 2 i=1 
X;—-6 X;—-86 
Xi-Aj = as | 7 8 — cy, — 4). 
2 2 


See Exercise 7.8.6 for the other statistics. 


Thus, these location-invariant, scale-invariant, and location- and scale-invariant 
statistics provide good illustrations, with the appropriate model for the pdf, of an- 
cillary statistics. Since an ancillary statistic and a complete (minimal) sufficient 
statistic are such opposites, we might believe that there is, in some sense, no rela- 
tionship between the two. This is true, and in the next section we show that they 
are independent statistics. 


EXERCISES 


7.8.1. Let X1, X2,...,X, be a random sample from each of the following distribu- 
tions involving the parameter 6. In each case find the mle of # and show that it is 
a sufficient statistic for 9 and hence a minimal sufficient statistic. 
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(a) b(1,0), where0 <0 <1. 

(b) Poisson with mean 0 > 0. 

(c) Gamma with a = 3 and =0> 0. 
(d) N(@,1), where —co <0 <oo. 

(e) N(0,0), where 0 < 6 < co. 


7.8.2. Let Y, < Yo < --- < Y, be the order statistics of a random sample of 
size n from the uniform distribution over the closed interval [—0,6] having pdf 


f(x; 0) = (1/28) I{-0,6(2). 
(a) Show that Yj and Y,, are joint sufficient statistics for 0. 
(b) Argue that the mle of 0 is 6 = max(—Yj, Y,). 


(c) Demonstrate that the mle @ is a sufficient statistic for 0 and thus is a minimal 
sufficient statistic for 0. 


7.8.3. Let Yi < Yo <---< Y, be the order statistics of a random sample of size n 
from a distribution with pdf 


1 
f(x; 01, 62) = (=) eW (© 81)/82 Ty.) (a), 
where —oco < 6; < oo and 0 < 62 < ov. Find the joint minimal sufficient statistics 
for 0, and Oo. 


7.8.4. Continuing with Example 7.8.4, show that the following statistics are location- 
invariant: 


(a) The sample variance = S?. 
(b) The mean deviation from the sample median = (1/n) >> |X; — median(X;)]. 
(c) (1/n) d1[Xi — min(X;)]. 


7.8.5. In Example 7.8.5, a scale model was presented and scale invariance was 
defined. Using the notation of this example, show that the following statistics are 
scale-invariant: 


(a) X?/> X?. 


(b) min{ X;}/max{X;}. 


7.8.6. Obtain the location and scale invariance of the other statistics listed in 
Example 7.8.6, i.e., the statistics 


(a) T, = [max(X;) — min(X;)|/S. 
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(b) Te = Oy (Xia — Xi)?/8?. 


7.8.7. With random samples from each of the distributions given in Exercises 
7.8.1(d), 7.8.2, and 7.8.3, define at least two ancillary statistics that are differ- 
ent from the examples given in the text. These examples illustrate, respectively, 
location-invariant, scale-invariant, and location- and scale-invariant statistics. 
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We have noted that if we have a sufficient statistic Y; for a parameter 6, 0 € Q, 
then h(z|y1), the conditional pdf of another statistic Z, given Yi; = yi, does not 
depend upon @. If, moreover, Y; and Z are independent, the pdf g2(z) of Z is 
such that go(z) = h(z|yi), and hence g2(z) must not depend upon @ either. So the 
independence of a statistic Z and the sufficient statistic Y; for a parameter @ imply 
that the distribution of Z does not depend upon @ € Q. That is, Z is an ancillary 
statistic. 

It is interesting to investigate a converse of that property. Suppose that the 
distribution of an ancillary statistic Z does not depend upon 6; then are Z and 
the sufficient statistic Y; for @ independent? To begin our search for the answer, 
we know that the joint pdf of Y; and Z is gi(y1;@)h(zly1), where gi(yi;0) and 
h(z|y1) represent the marginal pdf of Y; and the conditional pdf of Z given Y; = y1, 
respectively. Thus the marginal pdf of Z is 


/ © gu(yssO)h(ely) dyn = 9212). 


which, by hypothesis, does not depend upon @. Because 


J s2leauturse) dy = ga(z), 


if follows, by taking the difference of the last two integrals, that 
J lool) — Welndian (0130) dyn = 0 (7.91) 


for all 0 € Q. Since Yj is sufficient statistic for 6, h(z|y) does not depend upon 0. 
By assumption, g2(z) and hence g2(z) — h(z|y1) do not depend upon @. Now if the 
family {91 (y1;@) : 80 € Q} is complete, Equation (7.9.1) would require that 


g2(z) —h(z|y1) =0 or  go(z) = h(z|y1). 
That is, the joint pdf of Y; and Z must be equal to 
g(yr; O)h(z|y1) = 91 (413 8) 92(z). 


Accordingly, Y; and Z are independent, and we have proved the following theorem, 
which was considered in special cases by Neyman and Hogg and proved in general 
by Basu. 
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Theorem 7.9.1. Let X1,Xo,...,Xn denote a random sample from a distribution 
having a pdf f(a;6), 0 € Q, where Q is an interval set. Suppose that the statistic 
Y, is a complete and sufficient statistic for 0. Let Z = u(X1, X2,...,Xn) be any 
other statistic (not a function of Y; alone). If the distribution of Z does not depend 
upon 0, then Z is independent of the sufficient statistic Y,. 


In the discussion above, it is interesting to observe that if Y; is a sufficient 
statistic for 0, then the independence of Y; and Z implies that the distribution 
of Z does not depend upon 6 whether {g1(yi1;@) : 9 € Q} is or is not complete. 
Conversely, to prove the independence from the fact that go(z) does not depend 
upon 6, we definitely need the completeness. Accordingly, if we are dealing with 
situations in which we know that family {g1(yi;0) : @ © Q} is complete (such as a 
regular case of the exponential class), we can say that the statistic Z is independent 
of the sufficient statistic Y, if and only if the distribution of Z does not depend 
upon O(i.e., Z is an ancillary statistic). 

It should be remarked that the theorem (including the special formulation of 
it for regular cases of the exponential class) extends immediately to probability 
density functions that involve m parameters for which there exist m joint sufficient 
statistics. For example, let X,, X2,...,X» be a random sample from a distribution 
having the pdf f(a; 01,02) that represents a regular case of the exponential class 
so that there are two joint complete sufficient statistics for 9; and 62. Then any 
other statistic Z = u(X1, X2,..., Xn) is independent of the joint complete sufficient 
statistics if and only if the distribution of Z does not depend upon 6; or 62. 

We present an example of the theorem that provides an alternative proof of the 
independence of X and $?, the mean and the variance of a random sample of size n 
from a distribution that is N(u,07). This proof is given as if we were unaware that 
(n—1)S?/o? is x?(n—1), because that fact and the independence were established 
in Theorem 3.6.1. 


Example 7.9.1. Let X1,Xo,...,X, denote a random sample of size n from a 
distribution that is N(u,07). We know that the mean X of the sample is, for 
every known o?, a complete sufficient statistic for the parameter pz, —oo < ps < 00. 


Consider the statistic 
iL —_ 
= S (= XY, 


n—-1¢ 
w=1 


which is location-invariant. Thus S$? must have a distribution that does not depend 
upon p; and hence, by the theorem, S? and X, the complete sufficient statistic for 
4, are independent. 


Example 7.9.2. Let X1, Xo,...,X,, be a random sample of size n from the distri- 
bution having pdf 


f(a30) = exp{-(a-0)}, 0<4u<w, -~w<I<~m, 


= 0. elsewhere. 


Here the pdf is of the form f(a#—0), where f(w) = e~”, 0 < w < ov, zero elsewhere. 
Moreover, we know (Exercise 7.4.5) that the first order statistic Y; = min(X;) is a 
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complete sufficient statistic for 6. Hence Y; must be independent of each location- 
invariant statistic u(X1, X2,...,Xn), enjoying the property that 


ula, +d,vg+d,...,%n +d) = u(a1,22,...,0n) 


for all real d. Illustrations of such statistics are 9, the sample range, and 


Example 7.9.3. Let X,,X2 denote a random sample of size n = 2 from a distri- 
bution with pdf 


1 
f(a;0) = ae 0<2<0, 0<0<0, 
0 


elsewhere. 


The pdf is of the form (1/0) f(x/0), where f(w) = e~”, 0 < w < o, zero else- 
where. We know that Y; = X, + X2 is a complete sufficient statistic for 6. Hence, 
Y, is independent of every scale-invariant statistic u(X1,X2) with the property 
u(cx1, cx) = u(x1, £2). Illustrations of these are X,/X2 and X1/(X1 + X92), statis- 
tics that have F- and beta distributions, respectively. m 


Example 7.9.4. Let X1, X2,...,X, denote a random sample from a distribution 
that is N(61, 62), —oo < 0, < ~w, 0 < 02 < co. In Example 7.7.2 it was proved 


that the mean X and the variance S? of the sample are joint complete sufficient 
statistics for 6; and 62. Consider the statistic 


which satisfies the property that u(cr; + d,...,c&@m +d) = u(%1,...,%n). That is, 
the ancillary statistic Z is independent of both X and S?. = 


In this section we have given several examples in which the complete sufficient 
statistics are independent of ancillary statistics. Thus, in those cases, the ancillary 
statistics provide no information about the parameters. However, if the sufficient 
statistics are not complete, the ancillary statistics could provide some information 
as the following example demonstrates. 


Example 7.9.5. We refer back to Examples 7.8.1 and 7.8.2. There the first and 
nth order statistics, Y; and Y,,, were minimal sufficient statistics for 0, where the 
sample arose from an underlying distribution having pdf (3) (0—-1,041)(«). Often 
T, = (Yi + Y;)/2 is used as an estimator of 0, as it is a function of those sufficient 
statistics that is unbiased. Let us find a relationship between 7) and the ancillary 
statistic Tz = Y, — Yj. 
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The joint pdf of Y; and Y,, is 
(yi, nj 9) = n(n — 1) — m1)" 7/2", O-1<y <yn <0+1, 


zero elsewhere. Accordingly, the joint pdf of T, and T> is, since the absolute value 
of the Jacobian equals 1, 


t t 
A(t, te; 0) = n(n — 1)t2-2/2”, j=145 at aot las, Oe tee 2, 


zero elsewhere. Thus the pdf of 75 is 
ho(to; 0) = n(n — 1)t?-2(2 — te)/2", O<te <2, 


zero elsewhere, which, of course, is free of 8 as T> is an ancillary statistic. Thus, 
the conditional pdf of JT), given T> = ta, is 


1 to to 
gé-1+—<t @6+1—-—=, 0<t< 2, 
=a Ts 1< 64+ 5 2 


hyjo(tilte; @) = 2b 


zero elsewhere. Note that this is uniform on the interval (0 — 1+ t2/2,0+ 1-2/2); 
so the conditional mean and variance of T\ are, respectively, 


(2 — ta)? 


E(T\|\t2)=0@ and var(T\|t2) = rc 


Given T2 = tz, we know something about the conditional variance of T;. In particu- 
lar, if that observed value of T> is large (close to 2), then that variance is small and 
we can place more reliance on the estimator 7;. On the other hand, a small value 
of t2 means that we have less confidence in T| as an estimator of 6. It is extremely 
interesting to note that this conditional variance does not depend upon the sample 
size n but only on the given value of T2 = tz. As the sample size increases, T> tends 
to become larger and, in those cases, J; has smaller conditional variance. @ 


While Example 7.9.5 is a special one demonstrating mathematically that an 
ancillary statistic can provide some help in point estimation, this does actually 
happen in practice, too. For illustration, we know that if the sample size is large 
enough, then 

X—y 
— S/n 
has an approximate standard normal distribution. Of course, if the sample arises 
from a normal distribution, X and S are independent and T has a ¢-distribution with 
n — 1 degrees of freedom. Even if the sample arises from a symmetric distribution, 
X and S are uncorrelated and T has an approximate t-distribution and certainly an 
approximate standard normal distribution with sample sizes around 30 or 40. On 
the other hand, if the sample arises from a highly skewed distribution (say to the 
right), then X and S are highly correlated and the probability P(—1.96 < T < 1.96) 
is not necessarily close to 0.95 unless the sample size is extremely large (certainly 
much greater than 30). Intuitively, one can understand why this correlation exists if 


T 
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the underlying distribution is highly skewed to the right. While S has a distribution 
free of 4 (and hence is an ancillary), a large value of S$ implies a large value of X, 
since the underlying pdf is like the one depicted in Figure 7.9.1. Of course, a small 
value of X (say less than the mode) requires a relatively small value of S. This 
means that unless n is extremely large, it is risky to say that 


1.96s 1.96s 
= 


vn’ vn 


provides an approximate 95% confidence interval with data from a very skewed 
distribution. As a matter of fact, the authors have seen situations in which this 
confidence coefficient is closer to 80%, rather than 95%, with sample sizes of 30 to 
40. 


f(x) 
A 


= X 


Figure 7.9.1: Graph of a right skewed distribution; see also Exercise 7.9.14. 


EXERCISES 


7.9.1. Let Yi < Yo < Y3 < Y4 denote the order statistics of a random sample 
of size n = 4 from a distribution having pdf f(xz;0) = 1/0, 0 < x < 0, zero 
elsewhere, where 0 < 0 < co. Argue that the complete sufficient statistic Y4 for 0 
is independent of each of the statistics ¥1/Y4 and (Y; + Y2)/(¥3 + Y4). 

Hint: Show that the pdf is of the form (1/0) f(x/0), where f(w) =1, 0<w <1, 
zero elsewhere. 


7.9.2. Let Y, < Yo <--- < Y, be the order statistics of a random sample from a 
N(0,07), —oo < @ < 00, distribution. Show that the distribution of Z = Y, — X 
does not depend upon 6. Thus Y = )>/ Y;/n, a complete sufficient statistic for @ is 
independent of Z. 


7.9.3. Let X1,Xo,...,Xn be iid with the distribution N(0,07), —oo < @ < o. 
Prove that a necessary and sufficient condition that the statistics Z = xy a;X; and 
Y = 0} Xi, a complete sufficient statistic for 6, are independent is that 7’ a; = 0. 
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7.9.4. Let X and Y be random variables such that E(X*) and E(Y*) 4 0 exist 
for k = 1,2,3,.... If the ratio X/Y and its denominator Y are independent, prove 
that E[(X/Y)*) = BOVEY"), b= 1, 2,3; 4. 

Hint: Write E(X*) = E[Y*(X/Y)*]. 


7.9.5. Let Yi; < Yo <---< Y, be the order statistics of a random sample of size n 
from a distribution that has pdf f(«;@) = (1/@)e~*/°, 0 < a < co, 0 < 0 < 00, zero 
elsewhere. Show that the ratio R = nY; /)°7 Y; and its denominator (a complete 
sufficient statistic for 0) are independent. Use the result of the preceding exercise 
to determine E(R*), k = 1,2,3,.... 


7.9.6. Let X1, Xo,...,X5 be iid with pdf f(a) = e~*, 0 < x < ~, zero elsewhere. 
Show that (X1 + X2)/(X1 + X2+--+-+ Xs) and its denominator are independent. 
Hint: The pdf f(x) is a member of {f(x;0) : 0 < @ < co}, where f(x;0) = 
(1/@)e-*/°, 0 < & < 00, zero elsewhere. 


7.9.7. Let Y; < Yo <--- < Y, be the order statistics of a random sample from the 
normal distribution N(01,02), —co < 61 < co, 0 < 62 < co. Show that the joint 
complete sufficient statistics X = Y and S? for 6, and 2 are independent of each 
of (Y, — Y)/S and (Y, —¥i)/S. 


7.9.8. Let Y, < Yo <---< Y, be the order statistics of a random sample from a 
distribution with the pdf 


1 x— 6, 
f(x; 01, 62) = AP (- Os ) 


0, < x < w, zero elsewhere, where —oo < 6; < 00, 0 < 02 < co. Show that the 
joint complete sufficient statistics Y; and X = Y for the parameters 0; and 62 are 
independent of (¥2 — ¥1) /)c7(%i — Yi). 


7.9.9. Let X1,X2,...,X5 be a random sample of size n = 5 from the normal 
distribution N (0, 0). 


a) Argue that the ratio R = (X? + X?2)/(X2+.-.+ X?) and its denominator 
g 1 2 1 5 
(X? +--- +X?) are independent. 


(b) Does 5R/2 have an F-distribution with 2 and 5 degrees of freedom? Explain 
your answer. 


(c) Compute F(R) using Exercise 7.9.4. 
7.9.10. Referring to Example 7.9.5 of this section, determine c so that 
P(-c <T,-O< c|T> = ta) = 0.95. 


Use this result to find a 95% confidence interval for 6, given T2 = ta; and note how 
its length is smaller when the range of fz is larger. 


7.9.11. Show that Y = |X| is a complete sufficient statistic for 0 > 0, where X has 
the pdf fx(x;@) = 1/(20), for —6 < x < 6, zero elsewhere. Show that Y = |X| and 
Z = sgn(X) are independent. 
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7.9.12. Let Y; < Yo <.---< Y, be the order statistics of a random sample from a 
N(6,07) distribution, where o? is fixed but arbitrary. Then Y = X is a complete 
sufficient statistic for 0. Consider another estimator T of 0, such as T = (Y; + 
Yn41-i)/2, fori = 1,2,...,[n/2], or T could be any weighted average of these latter 
statistics. 


(a) Argue that T —X and X are independent random variables. 
(b) Show that Var(T) = Var(X) + Var(T — X). 


(c) Since we know Var(X) = o7/n, it might be more efficient to estimate Var(T) 
by estimating the Var(T’ —X) by Monte Carlo methods rather than doing that 
with Var(T) directly, because Var(T) > Var(T — X). This is often called the 
Monte Carlo Swindle. 


7.9.13. Suppose X1, X2,...,X, is a random sample from a distribution with pdf 
f(x; 0) = (1/2)03x?e-°*, 0 < a < cw, zero elsewhere, where 0 < 6 < 00: 


(a) Find the mle, 6, of 0. Is 6 unbiased? 
Hint: Find the pdf of Y = )77 X; and then compute E(6). 


(b) Argue that Y is a complete sufficient statistic for 0. 
(c) Find the MVUE of 0. 
(d 


wa 


Show that X,/Y and Y are independent. 
(e) What is the distribution of X,/Y? 
7.9.14. The pdf depicted in Figure 7.9.1 is given by 
fm,(@) =e-*(14+ myte-7)- 4), 00 < & < 00, (7.9.2) 


where mg > 0 (the pdf graphed is for mz = 0.1). This is a member of a large family 
of pdfs, log F-family, which are useful in survival (lifetime) analysis; see Chapter 3 
of Hettmansperger and McKean (2011). 


(a) Let W be a random variable with pdf (7.9.2). Show that W = log Y, where 
Y has an F-distribution with 2 and 2mz2 degrees of freedom. 


(b) Show that the pdf becomes the logistic (6.1.8) if m2 = 1. 
(c) Consider the location model where 
X,=0+W; eo eee 


where W,,...,W,, are iid with pdf (7.9.2). Similar to the logistic location 
model, the order statistics are minimal sufficient for this model. Show, similar 
to Example 6.1.2, that the mle of @ exists. 
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Chapter 8 


Optimal Tests of Hypotheses 


8.1 Most Powerful Tests 


In Section 4.5, we introduced the concept of hypotheses testing and followed it with 
the introduction of likelihood ratio tests in Chapter 6. In this chapter, we discuss 
certain best tests. 

For convenience to the reader, in the next several paragraphs we quickly review 
concepts of testing that were presented in Section 4.5. We are interested in a random 
variable X that has pdf or pmf f(x;@), where @ € Q. We assume that @ € wo or 
@ € wy, where wo and w are disjoint subsets of Q and wo Uw, = 2. We label the 
hypotheses as 


Ho: @ € wo versus Hy: 6 € wy. (8.1.1) 


The hypothesis Ho is referred to as the null hypothesis, while H, is referred to 
as the alternative hypothesis. The test of Ho versus H; is based on a sample 


Xj ,...,Xp from the distribution of X. In this chapter, we often use the vector 
X’ = (X1,..., Xn) to denote the random sample and x’ = (21,...,2n) to denote 
the values of the sample. Let S denote the support of the random sample X’ = 
(X1,...,Xn). 


A test of Hp versus Hy is based on a subset C of S. This set C is called the 
critical region and its corresponding decision rule is 


Reject Hp (Accept H,) ifXec (8.1.2) 
Retain Ho (Reject H;) if Xe C°. 


Note that a test is defined by its critical region. Conversely, a critical region defines 
a test. 

Recall that the 2 x 2 decision table, Table 4.5.1, summarizes the results of the 
hypothesis test in terms of the true state of nature. Besides the correct decisions, 
two errors can occur. A Type I error occurs if Ho is rejected when it is true, while 
a Type IT error occurs if Ho is accepted when Hy, is true. The size or significance 
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level of the test is the probability of a Type I error; i.e., 


a = max P9(X € C). (8.1.3) 


0Ewo 


Note that Ps(X € C) should be read as the probability that X € C when @ is the 
true parameter. Subject to tests having size a, we select tests that minimize Type IT 
error or equivalently maximize the probability of rejecting Hp when 6 € w;. Recall 
that the power function of a test is given by 


yo(9) = P(X EC); Feu. (8.1.4) 


In Chapter 4, we gave examples of tests of hypotheses, while in Sections 6.3 and 
6.4, we discussed tests based on maximum likelihood theory. In this chapter, we 
want to construct best tests for certain situations. 

We begin with testing a simple hypothesis Hp against a simple alternative H,. 
Let f(x;@) denote the pdf or pmf of a random variable X, where 6 € 0) = {6',0”}. 
Let wo = {0’} and w, = {6}. Let X’ = (X),..., Xp») be a random sample from 
the distribution of X. We now define a best critical region (and hence a best test) 
for testing the simple hypothesis Hp against the alternative simple hypothesis Hj. 


Definition 8.1.1. Let C denote a subset of the sample space. Then we say that C 
is a best critical region of size a for testing the simple hypothesis Hy : @ = 6! 
against the alternative simple hypothesis H, : @ = 0" if 


(a) Py [X €C]=a. 
(b) And for every subset A of the sample space, 


Po |X E Al =a> Pon (XE Cl = Pon (X € Al. 


This definition states, in effect, the following: In general, there is a multiplicity 
of subsets A of the sample space such that Py [X € A] = a. Suppose that there 
is one of these subsets, say C’, such that when Hy; is true, the power of the test 
associated with C’ is at least as great as the power of the test associated with every 
other A. Then C is defined as a best critical region of size a for testing Hp against 
A. 

As Theorem 8.1.1 shows, there is a best test for this simple versus simple case. 
But first, we offer a simple example examining this definition in some detail. 


Example 8.1.1. Consider the one random variable X that has a binomial distri- 


bution with n = 5 and p = 6. Let f(x;6) denote the pmf of X and let Ho : 6 = 4 
3 


and H,:@ = 5. The following tabulation gives, at points of positive probability 


density, the values of f(x;4), f(a; #), and the ratio f(x; 5)/f(2; 4). 
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F(e1/2) 1/32 5/32 | 10/32 
f(a:3/4) 1/1024 | 15/1024 | 90/1024 
F(a; 1/2)/f (a; 3/4) 32/1 32/3 32/9 
x 3 zi 5 
F(e1/2) 10/32 5/32 1/32 
f(a:3/4) 270/1024 | 405/1024 | 243/1024 
F(@:1/2)/Flas3/4) | 32/27 | 32/81 | 32/243 


We shall use one random value of X to test the aap hypothesis Hp : 6 = ~ 
against the alternative simple hypothesis H, : 0 = 3, and we shall first assign 
the significance level of the test to be a = We seek a best critical region of 
size a = gy. If Ay = {2 : & = 0} or Ap = {2 : x = 5}, then Prpiyjoy(X € 
Ai) = Pro=1/2}(X € Az) = 35 and there is no ihe subset Ag of the space {x : 
x = 0,1,2,3,4,5} such that Panu € A3) = noe Then either A, or Ag is 
the best critical eae C of size a = 35 for testing Ho against Hy. We note that 
Promijay(X € Ar) = wy and Po= 3/4} (x € Ai) = —. Thus, if the set A; is used as 
a critical region of lie a= — , we have the intolerable situation that the probability 
of rejecting Hp when H is true (Hp is false) is much less than the probability of 
rejecting Hp when Hp is true. 

On the other hand, if the set A» is used as a critical region, then Prg—1/2}(X € 
Ao) = > and Prga3/4y(X € Az) = Te That is, the probability of rejecting Ho 
when H;, is true is much greater than the probability of rejecting Ho when Ho is 
true. Certainly, this is a more desirable state of affairs, and actually Ag is the best 
critical region of size a = aoe The latter statement follows from the fact that when 
HA is true, there are poe two subsets, A; and Ag, of the sample space, each of whose 
probability measure is 35 and the fact that 


i 
32° 


Tou = Poa (X € Aa) > Pro=sjay(X € Ai) = apa: 


It enould be noted in this problem that the best critical region C = Ag of size 
@ = = is found by meas in C the point (or points) at which f(x; 4) is small in 
comps sison with f(a; #). This is seen to be true once it is observed that the ratio 
f(x; $)/f (a; 4) is a minimum at x = 5. Accordingly, the ratio f(x; 4)/ f(x; #), that 
is given in the last line of the above tabulation, provides us with a precise tool by 
which to and a best critical region C' for certain given values of a. To illustrate this, 
take a = — When Ho is true, each of the subsets i 22 = 0,1}, fa: a2 =0,4}, 
{e:2=1, ma {a : « = 4,5} has probability measure s. By direct computation it 
is found that the best critical region of this size is {w : = 4,5}. This reflects the 
fact that the ratio f(x; 4)/f(a;#) has its two smallest values for z = 4 and a = 5. 


The power of this test, which has a = s. is 


_ 405_ 243 _ 648 
Prons/ay(X = 4,5) = igsa + iosa = joa: 


The preceding example should make the following theorem, due to Neyman and 
Pearson, easier to understand. It is an important theorem because it provides a 
systematic method of determining a best critical region. 
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Theorem 8.1.1. Neyman—Pearson Theorem. Let X1, X2,...,Xn, where n 
is a fixed positive integer, denote a random sample from a distribution that has pdf 


or pmf f(a;0). Then the likelihood of X1,X2,...,Xn is 
x) = feu, for x’ = (a1,...,2n). 
i=l 


Let 0’ and 6" be distinct fixed values of 0 so that QQ = {0:0=6',0"}, and let k be 
a positive number. Let C be a subset of the sample space such that 


(a) pat, “I k, for each pointx € C. 
(b) HN) >k, for each point x € C¢ 
L(0"; x) x) d Pp - 


(c) a = Px [KX € Cl. 


Then C is a best critical region of size a for testing the simple hypothesis Ho : 0 = 0 
against the alternative simple hypothesis H, : 0 = 0". 


Proof: We shall give the proof when the random variables are of the continuous 
type. If C is the only critical region of size a, the theorem is proved. If there 
is another critical region of size a, denote it by A. For convenience, we shall let 
7 e J L(6;21,..-,%n)dx1-+-day be denoted by J, L(@). In this notation we wish 


[xe - fre) 20 


Since C is the union of the disjoint sets CN A and CM A® and A is the union of the 
disjoint sets AM C and AN C*, we have 


= [He -f 20) (8.1.5) 


However, by the hypothesis of the theorem, L(0”) > (1/k)L(@’) at each point of C, 
and hence at each point of CN A‘; thus, 


i Lo" >< | L(6'). 
onac k Jonac 


But L(0”) < (1/k)L(@") at each point of C°, and hence at each point of AN C*; 


accordingly, 
1 
| Le") < | L(0'). 
ANCce k Ance 


These inequalities imply that 


| L(6") - | L(6") > + | Hey = i, L(6'): 
cnAe ANCe k Jonae k Jance 


to show that 
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and, from Equation (8.1.5), we obtain 


/ L(0") -f L(0") > 
Cc A 
However, 


eee 


am | 


Dona ney I... 1(6)| ; (8.1.6) 


Sat hgh 
= [2@)- f 10) =0-a=0. 


If this result is substituted in inequality (8.1.6), we obtain the desired result, 


[2e)- f[ ne>0 


If the random variables are of the discrete type, the proof is the same with integra- 
tion replaced by summation. 


Remark 8.1.1. As stated in the theorem, conditions (a), (b), and (c) are sufficient 
ones for region C' to be a best critical region of size a. However, they are also 
necessary. We discuss this briefly. Suppose there is a region A of size a that does 
not satisfy (a) and (b) and that is as powerful at @ = 6” as C,, which satisfies (a), 
(b), and (c). Then expression (8.1.5) would be zero, since the power at 6” using A is 
equal to that using C. It can be proved that to have expression (8.1.5) equal zero, A 
must be of the same form as C. As a matter of fact, in the continuous case, A and C 
would essentially be the same region; that is, they could differ only by a set having 
probability zero. However, in the discrete case, if Py,|L(0’) = kL(0”)]| is positive, 
A and C could be different sets, but each would necessarily enjoy conditions (a), 
(b), and (c) to be a best critical region of size a. ™ 


It would seem that a test should have the property that its power should never 
fall below its significance level; otherwise, the probability of falsely rejecting Ho 
(level) is higher than the probability of correctly rejecting Ho (power). We say a 
test having this property is unbiased, which we now formally define: 


Definition 8.1.2. Let X be a random variable which has pdf or pmf f(a;0), where 
6€Q. Consider the hypotheses given in expression (8.1.1). Let X' = (X1,...,Xn) 
denote a random sample on X. Consider a test with critical region C and level a. 
We say that this test 1s unbiased if 


P(X E C) = a, 


for all 6 € wy. 
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As the next corollary shows, the best test given in Theorem 8.1.1 is an unbiased 
test. 


Corollary 8.1.1. As in Theorem 8.1.1, let C be the critical region of the best test 
of Hy) : 0 = 6 versus H, : 0 = 6”. Suppose the significance level of the test is a. 
Let yc (@") = Pou [X € C] denote the power of the test. Then a < yo(0"). 


Proof: Consider the “unreasonable” test in which the data are ignored, but a 
Bernoulli trial is performed which has probability a of success. If the trial ends 
in success, we reject Hp. The level of this test is a. Because the power of a test 
is the probability of rejecting Hp when Hy is true, the power of this unreasonable 
test is a also. But C is the best critical region of size a and thus has power greater 
than or equal to the power of the unreasonable test. That is, yo(0") > a, which is 
the desired result. 


Another aspect of Theorem 8.1.1 to be emphasized is that if we take C' to be 
the set of all points x which satisfy 


L(6'; x) 
——__ <k, k>0, 

L(0";x) ~ 

then, in accordance with the theorem, C' is a best critical region. This inequality 
can frequently be expressed in one of the forms (where c; and cz are constants) 


ur (x; 0",0") < cy 
or 
u2(x; 0,0") > e2. 


Suppose that it is the first form, uy < c,;. Since 6’ and @” are given constants, 
u1(X; 6’, 0”) is a statistic; and if the pdf or pmf of this statistic can be found when 
Ho is true, then the significance level of the test of Hp against H, can be determined 
from this distribution. That is, 


a = Py, [ur(X; 0,0") < ce]. 


Moreover, the test may be based on this statistic; for if the observed vector value 
of X is x, we reject Ho (accept H,) if ui(x) < a. 

A positive number k determines a best critical region C’ whose size is a = 
Py,(X € C] for that particular k. It may be that this value of a is unsuitable for 
the purpose at hand; that is, it is too large or too small. However, if there is a 
statistic u;(X) as in the preceding paragraph, whose pdf or pmf can be determined 
when Ho is true, we need not experiment with various values of k to obtain a 
desirable significance level. For if the distribution of the statistic is known, or can 
be found, we may determine c; such that Py, |ui(X) < ci] is a desirable significance 
level. 

An illustrative example follows. 
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Example 8.1.2. Let X’ = (X1,...,X,) denote a random sample from the distri- 
bution that has the pdf 


1 ? ( (x — 0)? 
——_ xX — 
V2 _ 2 
It is desired to test the simple hypothesis Ho : 6 = 6’ = 0 against the alternative 
simple hypothesis H, : @ = 6” = 1. Now 


f(x; 0) = 


) -O <2 < Ow. 


L(0';x) 7 (1/V2m)" e x» dm P| 
L(0";x) n 
(1/-V2m)" exp | — (a — w/a 


= exp (-S=+$). 
1 


If k > 0, the set of all points (11, x%2,...,%n) such that 


on(-Sa+$) ss 
1 


is a best critical region. This inequality holds if and only if 


_ > tut ] <logk 
2 
1 
or, equivalently, 
we > 5 logk=c. 
1 


In this case, a best critical region is the set C = {(21,%2,...,%n) : 07 ai > ch, 
where c is a constant that can be determined so that the size of the critical region 
is a desired number a. The event >> X; > c is equivalent to the event X > 
c/n = c, for example, so the test may be based upon the statistic X. If Ho is 
true, that is, 9 = 6’ = 0, then X has a distribution that is N(0,1/n). Given the 
significance level a, the number c; is computed in R as c) = qnorm(1—a,0,1/,/n); 
hence, Py,(X > ci) = a. So, if the experimental values of X1,X2,...,Xn were, 
respectively, 21, %2,...,@%, we would compute F = >} x;/n. If ZT > ci, the simple 
hypothesis Hp : 6 = 6’ = 0 would be rejected at the significance level a; if F< cy, 
the hypothesis Hy would be accepted. The probability of rejecting Hyp when Hp is 
true is a the level of significance. The probability of rejecting Hp, when Hp is false, 
is the value of the power of the test at 6 = 6” = 1, which is, 


Pare uy a dz. (8.1.7) 


=f aoa *»|- 21/n) 
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For example, if n = 25 and a is 0.05, cy = qnorm(0.95,0,1/5) = 0.329, using 
R. Hence, the power of the test to detect @ = 1, given in expression (8.1.7), is 
computed by 1 - pnorm(0.329,1,1/5) = 0.9996. m 


There is another aspect of this theorem that warrants special mention. It has to 
do with the number of parameters that appear in the pdf. Our notation suggests 
that there is but one parameter. However, a careful review of the proof reveals 
that nowhere was this needed or assumed. The pdf or pmf may depend upon 
any finite number of parameters. What is essential is that the hypothesis Hp and 
the alternative hypothesis H; be simple, namely, that they completely specify the 
distributions. With this in mind, we see that the simple hypotheses Hp and H, do 
not need to be hypotheses about the parameters of a distribution, nor, as a matter 
of fact, do the random variables X,, X2,...,X» need to be independent. That is, if 
Ho is the simple hypothesis that the joint pdf or pmf is g(a1,#2,...,@,,), and if Hy 
is the alternative simple hypothesis that the joint pdf or pmf is h(a1,72,...,2n), 
then C is a best critical region of size a for testing Hp against Hy, if, for k > 0, 


g(@1,%2,---,;2n) 


T1,%2,-+-,Un 


1 <k for (a1, %2,...,%n) EC. 


hi 
2. eta T) > fo bi tie ecaaet ee 


T1,U2Q,+++, In) 
3. a= Py, |(X1, Xe, Sie Xn) E Ci. 
Consider the following example. 


Example 8.1.3. Let X1,...,X,, denote a random sample on X that has pmf f(z) 
with support {0,1,2,...}. It is desired to test the simple hypothesis 


-1 
<- = 0, 1,2,... 
Ho: = al 2 ie) 
oI) { 0 elsewhere, 


against the alternative simple hypothesis 
es JD PO aU: 

Bas a) = { 0 elsewhere. 
That is, we want to test whether X has a Poisson distribution with mean A’ = 1 
versus X has a geometric distribution with p = 1/2. Here 

O Pigs angen) _ e"/(ay!xq!++- a!) 

A(@1,.-8n) — ()*(dyeetet ten 
(2e~1)" gd 


n 


T[@) 


1 


If k > 0, the set of points (x1, 22,...,2%n) such that 


(>: «) log 2 — log i) < logk — nlog(2e~*) =Cc 


1 
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is a best critical region C. Consider the case of k = 1 and n = 1. The preceding 
inequality may be written 2°!/a,! < e/2. This inequality is satisfied by all points 
in the set C = {x1 : 2; = 0,3,4,5,...}. Using R, the level of significance is 

Py, (X1 € C) =1-— Py, (X%1 = 1,2) = 1—-— dpois(1, 1) — dpois(2, 1) = 0.4482. 


The power of the test to detect H; is computed as 


Note that these results are consistent with Corollary 8.1.1. 


Remark 8.1.2. In the notation of this section, say C' is a critical region such that 


a= | 1) and $= LO"), 
Cc 


Ce 
where a@ and @ equal the respective probabilities of the Type I and Type II errors 


associated with C’. Let d; and dz be two given positive constants. Consider a certain 
linear function of a and (§, namely, 


a { LO) +a f 10") = ay f U0!) +a ji f xe] 


= dy +/ [d, L(0’) — d2L(6")). 
Cc 


If we wished to minimize this expression, we would select C' to be the set of all 
(%1,¥2,...,2%n) such that 


d,L(6’) — d2L(6"") <0 


or, equivalently, 


d 
< ra for all (a1, 22,...,%n) € C, 


which according to the Neyman-Pearson theorem provides a best critical region 
with k = dz/d,. That is, this critical region C is one that minimizes dja + d2/3. 
There could be others, including points on which L(6’)/L(0"”) = dz/d,, but these 
would still be best critical regions according to the Neyman—Pearson theorem. 
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EXERCISES 


8.1.1. In Example 8.1.2 of this section, let the simple hypotheses read Ho : 0 = 
6’ = 0 and AH, : 6 = 6” = —1. Show that the best test of Hp against H, may be 
carried out by use of the statistic X, and that if n = 25 and a = 0.05, the power of 
the test is 0.9996 when H; is true. 


8.1.2. Let the random variable X have the pdf f(a;0) = (1/@)e~*/°, 0< 2 < ~, 
zero elsewhere. Consider the simple hypothesis Ho : 6 = 6’ = 2 and the alternative 
hypothesis H; :6 = 6” =4. Let X1, X2 denote a random sample of size 2 from this 
distribution. Show that the best test of Hp against H; may be carried out by use 
of the statistic X, + Xo. 


8.1.3. Repeat Exercise 8.1.2 when H, : 6 = 6” = 6. Generalize this for every 
Oo > 2: 


8.1.4. Let X1, X2,..., X19 be arandom sample of size 10 from a normal distribution 
N(0,o7). Find a best critical region of size a = 0.05 for testing Hp : o? = 1 against 
Hy : 0? = 2. Is this a best critical region of size 0.05 for testing Hp : 0? = 1 against 
Ay ro? =a? Against Hy 207 =o > 1? 


8.1.5. If X), X2,...,X» is arandom sample from a distribution having pdf of the 
form f(a;0) = 0x°-1, 0 < a <1, zero elsewhere, show that a best critical region 
for testing Hp 1? =1 against Hy :0=2is C = {(a4,%9,...,¢,)20< [[L, a}. 


8.1.6. Let X1, X2,..., X19 be arandom sample from a distribution that is N (1, 02). 
Find a best test of the simple hypothesis Ho : 6; = 0, = 0, 02 = 04 = 1 against the 
alternative simple hypothesis Hy : 6; = 6/ =1, 0.2 = 049 =4. 


8.1.7. Let X1, Xo,...,X, denote a random sample from a normal distribution 
N(6,100). Show that C = {(a1,22,...,2n):¢< EF = Do] 2i/n} is a best critical 
region for testing Ho : @ = 75 against H, : 6 = 78. Find n and c so that 


Py, [(X1, X2,---;Xn) € C] = Pu, (X > c) = 0.05 
and 
Py, [(X1, X2,.-.,Xn) € C] = Px, (X > c) = 0.90, 
approximately. 
8.1.8. If X1, X2,..., Xp is a random sample from a beta distribution with param- 


eters a = 8 = 0 > 0, find a best critical region for testing Ho : 0 = 1 against 
Ay :6=2. 


8.1.9. Let X1,X2,...,Xn be iid with pmf f(z;p) = p*(1 — p)'~*, x = 0,1, zero 
elsewhere. Show that C = {(a1,...,@n) : 0 ai < c} is a best critical region for 
testing Ho : p = - against Hy : p= z: Use the Central Limit Theorem to find n 
and c so that approximately Py, (>) Xi < c) = 0.10 and Py, (S>} Xi < c) = 0.80. 
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8.1.10. Let X1, X2,...,Xi9 denote a random sample of size 10 from a Poisson 
distribution with mean @. Show that the critical region C' defined by ye x; > 3 
is a best critical region for testing Hp : 9? = 0.1 against H, : @ = 0.5. Determine, 
for this test, the significance level a and the power at 9 = 0.5. Use the R function 


ppois. 


8.2 Uniformly Most Powerful Tests 


This section takes up the problem of a test of a simple hypothesis Hp against an 
alternative composite hypothesis H;. We begin with an example. 


Example 8.2.1. Consider the pdf 
1 
f(x;0) = { . 


of Exercises 8.1.2 and 8.1.3. It is desired to test the simple hypothesis Hp : 0 = 2 
against the alternative composite hypothesis H, : 6 > 2. Thus 0 = {0:6 > 2}. 
A random sample, X 1, X2, of size n = 2 is used, and the critical region is C = 
{(a1,@2) : 9.5 < a1 + x42 < oo}. It was shown in the exercises cited that the 
significance level of the test is approximately 0.05 and the power of the test when 
6 = 4 is approximately 0.31. The power function y(0) of the test for all 0 > 2 is 


9.5 9.5—29 1 r+ 2x2 
1- | | @ exp (- r ) dx dx2 


= (a) e950 9 <9, 
0 

For example, (2) = 0.05, y(4) = 0.31, and (9.5) = 2/e = 0.74. It is shown 

(Exercise 8.1.3) that the set C = {(x1, 22) : 9.5 < a + 42 < ov} is a best critical 

region of size 0.05 for testing the simple hypothesis Hp : 6 = 2 against each simple 

hypothesis in the composite hypothesis H, :6> 2. = 


0<4<@w 
elsewhere 


I 


7(?) 


The preceding example affords an illustration of a test of a simple hypothesis 
Ho that is a best test of Ho against every simple hypothesis in the alternative 
composite hypothesis H,. We now define a critical region, when it exists, which 
is a best critical region for testing a simple hypothesis Ho against an alternative 
composite hypothesis H,. It seems desirable that this critical region should be a 
best critical region for testing Hp against each simple hypothesis in H,. That is, 
the power function of the test that corresponds to this critical region should be at 
least as great as the power function of any other test with the same significance 
level for every simple hypothesis in H,. 


Definition 8.2.1. The critical region C is a uniformly most powerful (UMP) 
critical region of size a for testing the simple hypothesis Hp against an alternative 
composite hypothesis H, if the set C is a best critical region of size a for testing 
Ho against each simple hypothesis in H,. A test defined by this critical region C 
is called a uniformly most powerful (UMP) test, with significance level a, for 
testing the simple hypothesis Hg against the alternative composite hypothesis Hy. 
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As will be seen presently, uniformly most powerful tests do not always exist. 
However, when they do exist, the Neyman—Pearson theorem provides a technique 
for finding them. Some illustrative examples are given here. 


Example 8.2.2. Let X1, X2,...,X, denote a random sample from a distribution 
that is N(0,0), where the variance @ is an unknown positive number. It will be 
shown that there exists a uniformly most powerful test with significance level a 
for testing the simple hypothesis Hp : 6 = 6’, where 6” is a fixed positive number, 
against the alternative composite hypothesis H; : 6 > 0’. Thus Q = {0:0 > 6’}. 
The joint pdf of X1, X2,..., Xn is 


1 n/2 i n 
L(0;21,%2,...,Un) = (55) exp ree . 
i=1 


Let 6” represent a number greater than 0’, and let & denote a positive number. Let 
C be the set of points where 


L(0'; £1, £2,..-,2Ln) 
L(0"; 21, 2,...,;2n) 


that is, the set of points where 


gl’ n/2 6” — 6! n . 
(5) exp - () dom] Sk 
1 


s 20/0" [n 6” 
2 : 
xu; > are Ee (5) — tog =. 
1 


The set C = {(21, 22,...,%n) : )0} 2? > c} is then a best critical region for testing 
the simple hypothesis Ho : 8 = 6’ against the simple hypothesis 6 = 6”. It remains 
to determine c, so that this critical region has the desired size a. If Ho is true, the 
random variable );’ X?/6' has a chi-square distribution with n degrees of freedom. 
Since a = Pp (>>) X?/0' > c/6"), c/6’ may be computed, for example, by the 
R code qchisq(1—a,n). Then C = {(21,22,...,2n) : 07 xz? > c} is a best 
critical region of size a for testing Ho : 0 = 0’ against the hypothesis 9 = 0”. 
Moreover, for each number 6” greater than 0’, the foregoing argument holds. That 
is, C = {(21,...,2n) : 30>) 2? > c} is a uniformly most powerful critical region 
of size a for testing Hy : 6 = 6 against H, : 0 > 6’. If x1,29,...,2%, denote 
the experimental values of X1,Xo9,...,Xn, then Hp : 6 = @’ is rejected at the 
significance level a, and H : 6 > 6’ is accepted if }*) x? > c; otherwise, Ho : 0 = 6’ 
is accepted. 

If, in the preceding discussion, we take n = 15, a = 0.05, and 6’ = 3, then 
the two hypotheses are Hp : 6 = 3 and H; : 0 > 3. Using R, c/3 is computed by 
qchisq(0.95,15) = 24.996. Hence, c = 74.988. 


or, equivalently, 
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Example 8.2.3. Let X1, X2,...,X, denote a random sample from a distribution 
that is N(0, 1), where @ is unknown. It will be shown that there is no uniformly most 
powerful test of the simple hypothesis Ho : 6 = 6’, where 6” is a fixed number against 
the alternative composite hypothesis H, : 0 4 0’. Thus Q = {@: —co < @ < co}. 
Let 6” be a number not equal to 6’. Let k be a positive number and consider 


n 


(1/2m)"/? exp -S =i A] 
< <k 


(1/2r)"/2 exp St =o ra 


1 


The preceding inequality may be written as 


on (” — Hons —((@")? - i} <k 
or 
"_ @ eee 9”)? — ")2) _ | rk. 
(0 dw 2 ll )° — (8')"] — log 
This last inequality is equivalent to 
py «> 5 (0" pty log k_ 


rovided that 0” > 6’, and it is equivalent to 
p 


log k_ k 
x TiS 5 (0" + a= 6” — 6" _ 6! 


if 0” < 6’. The first of these two expressions defines a best critical region for testing 
Hy : 0 =06' against the hypothesis 6 = 0” provided that 0” > 6’, while the second 
expression defines a best critical region for testing Ho : 6 = 0’ against the hypothesis 
0 = 6” provided that 6” < 6’. That is, a best critical region for testing the simple 
hypothesis against an alternative simple hypothesis, say 6 = 6’+1, does not serve as 
a best critical region for testing Hp : 6 = 6’ against the alternative simple hypothesis 
9 = 6’ —1. By definition, then, there is no uniformly most powerful test in the case 
under consideration. 

It should be noted that had the alternative composite hypothesis been one-sided, 
either H, : 06> 6’ or Hy: 0 < 6’, a uniformly most powerful test would exist in each 
instance. Hl 


Example 8.2.4. In Exercise 8.1.10, the reader was asked to show that if a random 
sample of size n = 10 is taken from a Poisson distribution with mean 6, the critical 
region defined by }7) x; > 3 is a best critical region for testing Ho : 6 = 0.1 against 
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A, :@=0.5. This critical region is also a uniformly most powerful one for testing 
Ho :6=0.1 against H, : @ > 0.1 because, with 0” > 0.1, 


(0.1)> %e~100-1) /(a Nao! «++ an!) <k 
(0) i @—10(9") /( 7 Sao! +++ an!) — 


is equivalent to 
es “ —10(0.1-0”) 
— e€ <k; 
g” = 


The preceding inequality may be written as 


(>: «| (log 0.1 — log 0”) < logk + 10(1 — 6”) 
1 


or, since 0” > 0.1, equivalently as 


” logk + 10 — 100” 
So 2, > 
: log 0.1 — log 6” 
Of course, 577 2; > 3 is of the latter form. m 


Let us make an important observation, although obvious when pointed out. Let 
X1,X9,...,X, denote a random sample from a distribution that has pdf f(a; 6), 0 € 
Q. Suppose that Y = u(X1, Xo,...,X,) is a sufficient statistic for 0. In accordance 
with the factorization theorem, the joint pdf of X), X2,...,X, may be written 


L(0; 1, %2,...,Un) = ky [u(w1, ©2,..., 2p); O]ko (a1, %2,..-,2n), 
where ko(a1,2%2,...,%n) does not depend upon 6. Consequently, the ratio 


L(G; 21,825 0++5%n) _ hy (ular, @2,...,0n);8'] 
L(6"s 1, %2,...,;0n) ky lu(x1,22,...,2n);6"] 


depends upon 21, %2,...,%p only through u(a1,72,...,%p). Accordingly, if there 
is a sufficient statistic Y = u(X1, X2,..., Xn) for 6 and if a best test or a uniformly 
most powerful test is desired, there is no need to consider tests that are based upon 
any statistic other than the sufficient statistic. This result supports the importance 
of sufficiency. 

In the above examples, we have presented uniformly most powerful tests. For 
some families of pdfs and hypotheses, we can obtain general forms of such tests. 
We sketch these results for the general one-sided hypotheses of the form 


Ho: 0<@ versus H,: 06> 6. (8.2.1) 


The other one-sided hypotheses with the null hypothesis Hp : 0 > 0’, is completely 
analogous. Note that the null hypothesis of (8.2.1) is a composite hypothesis. Recall 
from Chapter 4 that the level of a test for the hypotheses (8.2.1) is defined by 
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maxg<g y(9), where ¥(@) is the power function of the test. That is, the significance 
level is the maximum probability of Type I error. 

Let X’ = (X1,...,Xn) be a random sample with common pdf (or pmf) f(x; @), 
8 € Q, and, hence with the likelihood function 


L(0,x) = [fs = (Hig ces) 
i=1 


We consider the family of pdfs that has monotone likelihood ratio as defined next. 


Definition 8.2.2. We say that the likelihood L(0,x) has monotone likelihood 
ratio (mlr) in the statistic y = u(x) if, for 01 < 02, the ratio 
L(0,,x) 


L (8s, x) (8.2.2) 


is a monotone function of y = u(x). 


Assume then that our likelihood function L(0,x) has a monotone decreasing 
likelihood ratio in the statistic y = u(x). Then the ratio in (8.2.2) is equal to g(y), 
where g is a decreasing function. The case where the likelihood function has a mono- 
tone increasing likelihood ratio (i.e., g is an increasing function) follows similarly 
by changing the sense of the inequalities below. Let a denote the significance level. 
Then we claim that the following test is UMP level a for the hypotheses (8.2.1): 


Reject. Ho if Y > cy, (8.2.3) 


where cy is determined by a = Pg:[Y > cy]. To show this claim, first consider the 
simple null hypothesis Hj : @ = 6’. Let 0” > @' be arbitrary but fixed. Let C 
denote the most powerful critical region for 0’ versus 0”. By the Neyman—Pearson 
Theorem, C is defined by 


L(6’,X) 
L(0",X) < k if and only ifX¢€ C," 
where k is determined by a = Pg[X € C]. But by Definition 8.2.2, because 6” > 6’, 
L(6’, X) 


T(0",X) (Vahey oo (kh), 
where g~1(k) satisfies a = Po: [Y > g~1(k)]; i-e., cy = g7'(k). Hence the Neyman— 
Pearson test is equivalent to the test defined by (8.2.3). Furthermore, the test is 
UMP for 6’ versus 0” > 6’ because the test only depends on 6” > 6’ and g~1(k) is 
uniquely determined under 6’. 

Let yy (@) denote the power function of the test (8.2.3). To finish, we need to 
show that maxg<o yy (@) = a. But this follows immediately if we can show that 
yy (@) is a nondecreasing function. To see this, let 6; < 62. Note that since 6; < 62, 
the test (8.2.3) is the most powerful test for testing 6; versus #2 with the level 
yy (41). By Corollary 8.1.1, the power of the test at 02 must not be below the level; 
ie., yy (02) > yy (01). Hence yy (0) is a nondecreasing function. Since the power 
function is nondecreasing, it follows from Definition 8.1.2 that the mlr tests are 
unbiased tests for the hypotheses (8.2.1); see Exercise 8.2.14. 
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Example 8.2.5. Let X1, X2,...,X, be a random sample from a Bernoulli distri- 
bution with parameter p = 0, where 0 < 6 < 1. Let 6’ < 6”. Consider the ratio of 
likelihoods 


L(6';@1,%2,-..,0n) _ (#)U™(1—O)P->™ — [1 — 6") Ue 71 _g\” 
L(6";24,22,-..,%n) (O22 21 (1 — )\n—Lte — al (; - =) 
Since 0’/0” < 1 and (1—0”)/(1— 6’) < 1, so that 6/(1 — 0”)/0"(1 — 6’) < 1, the 
ratio is a decreasing function of y = 5>a;. Thus we have a monotone likelihood 
ratio in the statistic Y = }> X;. 
Consider the hypotheses 


Ho: 0< versus H,: 06> 6. (8.2.4) 


By our discussion above, the UMP level a decision rule for testing Hp versus Hy is 
given by 
Reject Ho if Y = 07, Xi >, 


where c is such that a = Py |Y > c]. a 


In the last example concerning a Bernoulli pmf, we obtained a UMP test by 
showing that its likelihood possesses mlr. The Bernoulli distribution is a regular 
case of the exponential family and our argument, under the one assumption below, 
can be generalized to the entire regular exponential family. To show this, suppose 
that the random sample X,, X2,...,X, arises from a pdf or pmf representing a 
regular case of the exponential class, namely, 


ae ae 


where the support of X, S, is free of 6. Further assume that p(@) is an increasing 
function of @. Then 


n 


exp end 009 + S°H(a) + vat) 


1 


exp onde + S°H(ai) + vat") 


exp [p(6') — v(6")| S$) K (ai) + nfa(O’) - xo} 


1 


If 6’ < 6”, p(@) being an increasing function, requires this ratio to be a decreasing 
function of y = 57’ K(2x;). Thus, we have a monotone likelihood ratio in the statistic 
Y = 50] K(X;). Hence consider the hypotheses 


Ho: 0<@ versus H,: 6> 6’. (8.2.5) 
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By our discussion above concerning mlr, the UMP level a decision rule for testing 
Ho versus Hy is given by 


Reject Ho if Y = $~K(X;) >, 
i=1 


where c is such that a = Po [Y > c]. Furthermore, the power function of this test 
is an increasing function in 6. 
For the record, consider the other one-sided alternative hypotheses, 


Ho: 0>0' versus H,: 0< 6. (8.2.6) 


The UMP level a decision rule is, for p(@) an increasing function, 
Reject Ho if Y = }_K(X;j) <c, 
i=l 


where c is such that a = P9[Y < dc]. 

If in the preceding situation with monotone likelihood ratio we test Ho : 6 = 
6’ against Hy : 0 > 0’, then }> K(a;) > c would be a uniformly most powerful 
critical region. From the likelihood ratios displayed in Examples 8.2.2—8.2.5, we see 
immediately that the respective critical regions 


are uniformly most powerful for testing Ho : 6 = 0’ against H, :0 > 6’. 

There is a final remark that should be made about uniformly most powerful 
tests. Of course, in Definition 8.2.1, the word uniformly is associated with 6; that 
is, C is a best critical region of size a for testing Ho : 6 = 09 against all # values 
given by the composite alternative H,. However, suppose that the form of such a 
region is 

u(@1,@2,..-,2n) <¢. 


Then this form provides uniformly most powerful critical regions for all attainable a 
values by, of course, appropriately changing the value of c. That is, there is a certain 
uniformity property, also associated with a, that is not always noted in statistics 
texts. 


EXERCISES 


8.2.1. Let X have the pmf f(2;0@) = 07(1—6)!~*, x = 0,1, zero elsewhere. We 


test the simple hypothesis Ho : @ = + against the alternative composite hypothesis 


Ay: 0< + by taking a random sample of size 10 and rejecting Ho : 6 = + if and 
only if the observed values x1, 22,...,219 of the sample observations are such that 


tele x; <1. Find the power function 7(0), 0<0< i, of this test. 
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8.2.2. Let X have a pdf of the form f(x;6) = 1/0, 0 < x < @, zero elsewhere. Let 
Y, < Yo < Y3 < Y4 denote the order statistics of a random sample of size 4 from 
this distribution. Let the observed value of Y4 be y4. We reject Hp : 6 = 1 and 
accept H, : 6 #1 if either y, < 4 or ys > 1. Find the power function (0), 0 < 4, 
of the test. 


8.2.3. Consider a normal distribution of the form N(6,4). The simple hypothesis 
Ho : 8 = 0 is rejected, and the alternative composite hypothesis H, : @ > 0 is 
accepted if and only if the observed mean % of a random sample of size 25 is greater 
than or equal to 3. Find the power function (0), 0 < 0, of this test. 


8.2.4. Consider the distributions N(j11, 400) and N(p2, 225). Let 0 = ju, — pa. Let 
x and y denote the observed means of two independent random samples, each of 
size n, from these two distributions. We reject Ho : @ = 0 and accept Hy : 6 > 0 if 
and only if 7-7 > c. If y(@) is the power function of this test, find n and c so that 
(0) = 0.05 and (10) = 0.90, approximately. 


8.2.5. Consider Example 8.2.2. Show that £(6) has a monotone likelihood ratio in 
the statistic }*\"_, X?. Use this to determine the UMP test for Ho : 6 = 6’, where 
0’ is a fixed positive number, versus H, : 0 < 6’. 


8.2.6. If, in Example 8.2.2 of this section, Hy : 6 = 0’, where 6’ is a fixed positive 
number, and H; : 6 4 0’, show that there is no uniformly most powerful test for 
testing Hp against A. 


8.2.7. Let X1, Xo,...,X25 denote a random sample of size 25 from a normal dis- 
tribution (0,100). Find a uniformly most powerful critical region of size a = 0.10 
for testing Hp : 6 = 75 against Hy : 6 > 75. 


8.2.8. Let X 1, Xo,...,X, denote a random sample from a normal distribution 
N(@, 16). Find the sample size n and a uniformly most powerful test of Ho : 6 = 25 
against H, : @ < 25 with power function y(@) so that approximately y(25) = 0.10 
and (23) = 0.90. 


8.2.9. Consider a distribution having a pmf of the form f(x;0) = 07(1—0)'~*, w= 
0,1, zero elsewhere. Let Hp : 0 = aa and H, :@0> a Use the Central Limit 
Theorem to determine the sample size n of a random sample so that a uniformly 
most powerful test of Hp against H, has a power function y(@), with approximately 


1(35) = 0.05 and ¥(74) = 0.90. 


8.2.10. Illustrative Example 8.2.1 of this section dealt with a random sample of 
size n = 2 from a gamma distribution with a = 1, G6 = 6. Thus the megf of the 
distribution is (1— 0t)~!, t < 1/0, 0 > 2. Let Z = X,; + X2. Show that Z has 
a gamma distribution with a = 2, @ = 0. Express the power function 7(6) of 
Example 8.2.1 in terms of a single integral. Generalize this for a random sample of 
size n. 


8.2.11. Let X),Xo,...,X, be a random sample from a distribution with pdf 
f(a;0) = 0x°-!, 0 < x <1, zero elsewhere, where 6 > 0. Show the likelihood has 
mlr in the statistic [];_, X;. Use this to determine the UMP test for Ho : 6 = 6’ 
against H, : 6 < 6’, for fixed 6’ > 0. 
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8.2.12. Let X have the pdf f(x;@) = 07(1 — @)'~*, x = 0,1, zero elsewhere. We 
test Hp : @= - against H, :@< - by taking a random sample X,, Xo,...,X5 of 
size n = 5 and rejecting Ho if Y = }>/ X; is observed to be less than or equal to a 
constant c. 


(a) Show that this is a uniformly most powerful test. 
(b) Find the significance level when c = 1. 
(c) Find the significance level when c = 0. 


(d) By using a randomized test, as discussed in Example 4.6.4, modify the tests 
given in parts (b) and (c) to find a test with significance level a = 4. 


8.2.13. Let X),...,X, denote a random sample from a gamma-type distribution 
with a=2 and G=86. Let Hp:0=1and H,:0>1. 


(a) Show that there exists a uniformly most powerful test for Ho against Hy, 
determine the statistic Y upon which the test may be based, and indicate the 
nature of the best critical region. 


(b) Find the pdf of the statistic Y in part (a). If we want a significance level of 
0.05, write an equation that can be used to determine the critical region. Let 
(0), @ > 1, be the power function of the test. Express the power function as 
an integral. 


8.2.14. Show that the mlr test defined by expression (8.2.3) is an unbiased test for 
the hypotheses (8.2.1). 


8.3 Likelihood Ratio Tests 


In the first section of this chapter, we presented the most powerful tests for sim- 
ple versus simple hypotheses. In the second section, we extended this theory to 
uniformly most powerful tests for essentially one-sided alternative hypotheses and 
families of distributions that have a monotone likelihood ratio. What about the 
general case? That is, suppose the random variable X has pdf or pmf f(z; @), 
where @ is a vector of parameters in 2. Let w C 2 and consider the hypotheses 


Ho: 0 €w versus Hy, : OE NNW. (8.3.1) 


There are complications in extending the optimal theory to this general situation, 
which are addressed in more advanced books; see, in particular, Lehmann (1986). 
We illustrate some of these complications with an example. Suppose X has a 
N(61, 62) distribution and that we want to test 0; = 6{, where 04 is specified. In 
the notation of (8.3.1), @ = (61,02), Q = {8 : -—co < 0) < 00,62 > O}, and 
w = {0: 0, = 61,02 > 0}. Notice that Ho : @ € w is a composite null hypothesis. 
Let X1,...,X», be a random sample on X. 

Assume for the moment that 92 is known. Then Hop becomes the simple hypoth- 
esis 0; = 01. This is essentially the situation discussed in Example 8.2.3. There 
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it was shown that no UMP test exists for this situation. If we restrict attention 
to the class of unbiased tests (Definition 8.1.2), then a theory of best tests can be 
constructed; see Lehmann (1986). For our illustrative example, as Exercise 8.3.21 
shows, the test based on the critical region 


i 7 
C2 = {Fa > \/ “aah 


is unbiased. Then it follows from Lehmann that it is an UMP unbiased level a test. 

In practice, though, the variance 62 is unknown. In this case, theory for optimal 
tests can be constructed using the concept of what are called conditional tests. 
We do not pursue this any further in this text, but refer the interested reader to 
Lehmann (1986). 

Recall from Chapter 6 that the likelihood ratio tests (6.3.3) can be used to test 
general hypotheses such as (8.3.1). While in general the exact null distribution of the 
test statistic cannot be determined, under regularity condtions the likelihood ratio 
test statistic is asymptotically y? under Hy. Hence we can obtain an approximate 
test in most situations. Although, there is no guarantee that likelihood ratio tests 
are optimal, similar to tests based on the Neyman—Pearson Theorem, they are 
based on a ratio of likelihood functions and, in many situations, are asymptotically 
optimal. 

In the example above on testing for the mean of a normal distribution, with 
known variance, the likelihood ratio test is the same as the UMP unbiased test. 
When the variance is unknown, the likelihood ratio test results in the one-sample 
t-test as shown in Example 6.5.1 of Chapter 6. This is the same as the conditional 
test discussed in Lehmann (1986). 

In the remainder of this section, we present likelihood ratio tests for situations 
when sampling from normal distributions. 


8.3.1 Likelihood Ratio Tests for Testing Means of Normal 
Distributions 


In Example 6.5.1 of Chapter 6, we derived the likelihood ratio test for the one- 
sample t-test to test for the mean of a normal distribution with unknown variance. 
In the next example, we derive the likelihood ratio test for compairing the means 
of two independent normal distributions. We then discuss the power functions for 
both of these tests. 


Example 8.3.1. Let the independent random variables X and Y have distributions 
that are N (01,63) and N (62, 43), where the means 6; and 02 and common variance 63 
are unknown. Then 2 = {(1, 02,03) : —co < 01 < 00, -co < 02 < co, 0 < 03 < ov}. 
Let X1,Xo,...,X, and Y,, Y2,...,¥m denote independent random samples from 
these distributions. The hypothesis Hp : 0; = 02, unspecified, and 63 unspecified, 
is to be tested against all alternatives. Then w = {(01, 02,03) : —oco < 0; = 02 < 
00,0 < 03 < co}. Here X41, Xo,...,Xn, Vi, Yo,...,¥m are n+m > 2 mutually 
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independent random variables having the likelihood functions 


1 (nt+m)/2 1 n . m : 
L(w) = (=) op St 61)" 4 Soi 61) |} 


and 
1 (n+m)/2 i n 
L(Q) = | —— -— i- ; — 92) : 
= (Se) etm agy [Dea + wo 
If Olog L(w)/00, and Olog L(w)/063 are equated to zero, then (Exercise 8.3.2) 


ty -6)+ i-%1) = 0 


1 


7 pac _ 04)” + Soli = a = nt+m. (8.3.2) 


1 1 


The solutions for 6; and 63 are, respectively, 
u = (n+ m)* {Sos + sul 
1 1 
w = (n+m)} {3 —u)? + Ss - of ; 
1 


1 


Further, u and w maximize L(w). The maximum is 


= 4 ) (n-+m)/2 


In a like manner, if 


OlogL(Q) AlogL(Q) Alog L(Q) 
00, : 062 : 003 


are equated to zero, then (Exercise 8.3.3) 


n 


S “(yi — 92) = 0 (8.3.3) 
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The solutions for 61, 02, and 63 are, respectively, 
n 
Uy = nt s- Xi 
1 
Uy = m+ S- Yi 
1 


wo = (n+m)} De uw)? 4 Soi mae 


and, further, wi, uz, and w’ maximize L(Q). The maximum is 


2 


10) = (=) 


so that ae 
LG w! n+m 
A(eiystysthys stim) =A = 2) = (2) 3 


The random variable defined by A2/("+™ is 


n m 


pee —X)? + ye =¥ 


1 it 


YOLK = [(nX + mY)/(n + mV? + SOV; — [(nX + mY)/(n + m)]}P? 


Now 
nxX +myY = = nX +mY ° 
Xi- = X;,—X xX — 
a) Oe ee 
n 2 
X+myY 
= - FP +n(X-2 = ) 
n+tm 
and 
m ee y 2 m YX way 2 
Be | 
; n+m ; n+m 
m =—\ 2 
= X+mY 
= 2s rtm (¥ ae ) 
; n+m 
But 
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and 


S (Xi - XP + 50% -Y)? 


OG - XP + OG - YP + [am/(n + m)|(X - YP 


If the hypothesis Hp : 6, = 02 is true, the random variable 
n m -1/2 


(X -Y) G +m—2)-? pare —-Xy+ yw -Y)? 


1 1 


p= mm 


n+m 

(8.3.4) 

has, in accordance with Section 3.6, a ¢-distribution with n + m — 2 degrees of 
freedom. Thus the random variable defined by A2/("+™) ig 


n+m—2 
(n+m—2)4+T?° 


The test of Ho against all alternatives may then be based on a ¢-distribution with 
n +m — 2 degrees of freedom. 

The likelihood ratio principle calls for the rejection of Ho if and only if A < Ao < 
1. Thus the significance level of the test is 


a= Pa (A KX tig oncg My Mtges 25 Ye) as Ao]. 
However, A(X1,.-.,Xn,¥1,---,¥m) < Xo is equivalent to |T| > c, and so 
a= P(|T| > c; Ho). 


For given values of n and m, the number c is is easily computed. In R, c =qt(1 — 
a/2,n+m— 2). Then Hp is rejected at a significance level a if and only if |t] > ¢, 
where t is the observed value of T. If, for instance, n = 10, m = 6, and a = 0.05, 
then c = qt(0.975, 14) = 2.1448. m 


For this last example as well as the one-sample t-test derived in Example 6.5.1, it 
was found that the likelihood ratio test could be based on a statistic that, when the 
hypothesis Ho is true, has a t-distribution. To help us compute the power functions 
of these tests at parameter points other than those described by the hypothesis Ho, 
we turn to the following definition. 
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Definition 8.3.1. Let the random variable W be N(06,1); let the random variable 
V be x7(r), and let W and V be independent. The quotient 


WwW 


VV/r 


is said to have a noncentral t-distribution with r degrees of freedom and noncen- 
trality parameter 6. If } =0, we say that T has a central t-distribution. 


In the light of this definition, let us reexamine the ¢-statistics of Examples 6.5.1 
and 8.3.1. 


Example 8.3.2 (Power of the One Sample ¢t-Test). For Example 6.5.1, consider a 
more general situation. Assume that X,,...,X, is a random sample on X that has 
a N(, 07) distribution. We are interested in testing Ho : 4. = fo versus Hy : w # [U0, 
where [Uo is specified. Then from Example 6.5.1, the likelihood ratio test statistic is 


PM sing hg) 4 


So(% — X)P?/[o? (n - 1) 
\4 
The hypothesis Ho is rejected at level a if |t] > ta2n—1. Suppose fi A po is an 


alternative of interest. Because E,,,[\/nX /aV/nX /o] = V/n(p11 — fo) /o, the power 
of the test to detect p11 is 


y(t) = P(E] > tajam—1) =1— PC < tajom—1) + PC < -tayan—s), (8.3.5) 


where ¢ has a noncentral t-distribution with noncentrality parameter 6 = /n(f41 — 
}to)/o and n — 1 degrees of freedom. This is computed in R by the call 

1 - pt(tc,n-1,ncp=delta) + pt(-tc,n-1,ncp=delta) 
where tc is ty/2,,—-1 and delta is the noncentrality parameter 6. 

The following R code computes a graph of the power curve of this test. Notice 
that the horizontal range of the plot is the interval [uo — 40/./n, wo + 40//n]. As 
indicated the parameters need to be set. 

## Input mu0O, sig, n, alpha. 
fse = 4*sig/sqrt(n); maxmu = mu0O + fse; tc = qt(1-(alpha/2) ,n-1) 
minmu = muO -fse; mui = seq(minmu,maxmu, .1) 
delta = (mui-mu0)/(sig/sqrt (n)) 
gs = 1 - pt(tc,n-1,ncp=delta) + pt(-tc,n-1,ncp=delta) 
plot (gs~mul,pch="_",xlab=expression(mu[1]) ,ylab=expression(gamma) ) 
lines (gs~mu1) 
This code is the body of the function tpowerg.R. Exercise 8.3.5 discusses its use. Hl 
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Example 8.3.3 (Power of the Two Sample t-Test). In Example 8.3.1 we had 
W2 


JVV2/(n +m — 2)’ 


where 
gay -¥)/o 
n+m 
and 
Wx x) 4 yi ¥) 
oe = 1 


Here W2 is N[,/nm/(n + m)(01 — 92) /0, 1], Va is x?(n +m —2), and W2 and V2 are 
independent. Accordingly, if 0; 4 62, T has a noncentral t-distribution with n+m—2 
degrees of freedom and noncentrality parameter dg = \/nm/(n + m)(01 — 62)/o. It 
is interesting to note that 6, = \/n6;/o measures the deviation of 6; from 0, = 0 
in units of the standard deviation o/\/n of X. The noncentrality parameter 62 = 
/nm/(n+m)(01 — 02)/o is equal to the deviation of 6; — 62 from 0, — 62 = 0 in 
units of the standard deviation o/,/(n + m)/mn of X —Y. 

As in the last example, it is easy to write R code that evaluates power for this 
test. For a numerical illustration, assume that the common variance is 63 = 100, 
n = 20, and m = 15. Suppose a = 0.05 and we want to determine the power 
of the test to detect A = 5, where A = 6, — 69. In this case the critical value is 
to.25,33 = qt(.975, 33) = 2.0345 and the noncentrality parameter is d2 = 1.4639. The 
power is computed as 

1- pt(2.0345,33,ncp=1.4639) + pt(-2.0345,33,ncp=1.4639) = 0.2954 
Hence, the test has a 29.4% chance of detecting a difference in means of 5. 


Remark 8.3.1. The one- and two-sample tests for normal means, presented in 
Examples 6.5.1 and 8.3.1, are the tests for normal means presented in most elemen- 
tary statistics books. They are based on the assumption of normality. What if the 
underlying distributions are not normal? In that case, with finite variances, the 
t-test statistics for these situations are asymptotically correct. For example, con- 
sider the one-sample t-test. Suppose Xj,...,X, are iid with a common nonnormal 
pdf that has mean 6, and finite variance o?. The hypotheses remain the same, i.e., 
Ho: 6; = 0 versus H, : 6; 4 6;. The t-test statistic, T,,, is given by 


Vn(X — 64) 


Th = 
Sn 


, (8.3.6) 
where S;, is the sample standard deviation. Our critical region is Cy = {|T,| > 
to/2,n-1}- Recall that S;, — o in probability. Hence, by the Central Limit Theorem, 
under Ho, 
¥ 9! 
1, = YAH) Ea (8.3.7) 
a a 
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where Z has a standard normal distribution. Hence the asymptotic test would use 
the critical region C2 = {|Tn| => za/2}- By (8.3.7) the critical region C2 would have 
approximate size a. In practice, we would use C;, because ¢ critical values are gen- 
erally larger than z critical values and, hence, the use of C; would be conservative; 
i.e., the size of Cy would be slightly smaller than that of Cy. As Exercise 8.3.4 
shows, the two-sample t-test is also asymptotically correct, provided the underlying 
distributions have the same variance. 


For nonnormal situations where the distribution is “close” to the normal distri- 
bution, the t-test is essentially valid; i.e., the true level of significance is close to the 
nominal a. In terms of robustness, we would say that for these situations the t-test 
possesses robustness of validity. But the t-test may not possess robustness of 
power. For nonnormal situations, there are more powerful tests than the t-test; 
see Chapter 10 for discussion. 

For finite sample sizes and for distributions that are decidedly not normal, very 
skewed for instance, the validity of the t-test may also be questionable, as we illus- 
trate in the following simulation study. 


Example 8.3.4 (Skewed Contaminated Normal Family of Distributions). Consider 
the random variable X given by 


X=(1-1)Z4+LY, (8.3.8) 


where Z has a N(0, 1) distribution, Y has a N(j1, 02) distribution, J, has a bin(1, €) 
distribution, and Z, Y, and J, are mutually independent. Assume that «€ < 0.5 
and o, > 1, so that Y is the contaminating random variable in the mixture. If 
bbe = 0, then X has the contaminated normal distribution discussed in Section 
3.4.1, which is symmetrically distributed about 0. For uw. 4 0, the distribution of X, 
(8.3.8), is skewed and we call it the skewed contaminated normal distribution, 
SCN(e,0¢, uc). Note that E(X) = eu, and in Exercise 8.3.18 the cdf and pdf of X 
are derived. The R function rscn generates random variates from this distribution. 

In this example, we show the results of a small simulation study on the validity 
of the t-test for random samples from the distribution of X. Consider the one-sided 
hypotheses 

Ao: [6 = px versus Ho: pb < pox. 


Let X1, X2,...,X, be a random sample from the distribution of X. As a test 
statistic we consider the t-test discussed in Example 4.5.4, which is also given in 
expression (8.3.6); that is, the test statistic is T, = (X — px)/(Sn//n), where X 
and S;, are the sample mean and standard deviation of X1, X2,..., Xn, respectively. 
We set the level of significance at a = 0.05 and used the decision rule: Reject Ho 
if T, < to.o5n—1- For the study, we set n = 30, € = 0.20, and a, = 25. For jc, we 
selected the five values of 0,5,10,15, and 20, as shown in Table 8.3.1. For each of 
these five situations, we ran 10,000 simulations and recorded @, which is the number 
of rejections of Hg divided by the number of simulations, i.e., the empirical a level. 

For the test to be valid, @ should be close to the nominal value of 0.05. As 
Table 8.3.1 shows, though, for all cases other than jz. = 0, the t-test is quite liberal; 
that is, its empirical significance level far exceeds the nominal 0.05 level (as Exercise 
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Table 8.3.1: Empirical a Levels for the Nominal 0.05 t-Test of Example 8.3.4. 
Empirical a 
5 


ef 0] 


8.3.19 shows, the sampling error in the table is about 0.004). Note that when p. = 0 
the distribution of X is symmetric about 0 and in this case the empirical level is 
close to the nominal value of 0.05. 


8.3.2 Likelihood Ratio Tests for Testing Variances of Normal 
Distributions 


In this section, we discuss likelihood ratio tests for variances of normal distributions. 
In the next example, we begin with the two sample problem. 


Example 8.3.5. In Example 8.3.1, in testing the equality of the means of two 
normal distributions, it was assumed that the unknown variances of the distributions 
were equal. Let us now consider the problem of testing the equality of these two 
unknown variances. We are given the independent random samples Xj,...,X, and 
Yi,.--,;¥m from the distributions, which are N(@1,03) and N(62, 64), respectively. 
We have 

Q = {(01, 02, 03, 04) :—00 < 61,02 < 00,0 < 43,04 < oo}. 


The hypothesis Ho : 03 = 04, unspecified, with 0, and 2 also unspecified, is to be 
tested against all alternatives. Then 


io {(61, 42, 83, 04) :-OoO< 01, O2 <0,0< 03 = 04 < oo}. 


It is easy to show (see Exercise 8.3.11) that the statistic defined by A = L()/L(Q) 
is a function of the statistic 


(8.3.9) 


If 63 = 04, this statistic F has an F-distribution with n — 1 and m — 1 degrees of 
freedom. The hypothesis that (61, 02, 63,04) € w is rejected if the computed F < c, 
or if the computed F' > co. The constants c, and cz are usually selected so that, if 
63 = 04, 
ay 
PR sa)= PF 2a) == 


where a, is the desired significance level of this test. The power function of this 
test is derived in Exercise 8.3.10. 
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Remark 8.3.2. We caution the reader on this last test for the equality of two 
variances. In Remark 8.3.1, we discussed that the one- and two-sample t-tests for 
means are asymptotically correct. The two-sample variance test of the last example 
is not, however; see, for example, page 143 of Hettmansperger and McKean (2011). 
If the underlying distributions are not normal, then the F-critical values may be 
far from valid critical values (unlike the t-critical values for the means tests as 
discussed in Remark 8.3.1). In a large simulation study, Conover, Johnson, and 
Johnson (1981) showed that instead of having the nominal size of a = 0.05, the 
F-test for variances using the F-critical values could have significance levels as high 
as 0.80, in certain nonnormal situations. Thus the two-sample F-test for variances 
does not possess robustness of validity. It should only be used in situations where 
the assumption of normality can be justified. See Exercise 8.3.17 for an illustrative 
data set. 


The corresponding likelihood ratio test for the variance of a normal distribution 
based on one sample is discussed in Exercise 8.3.9. The cautions raised in Remark 
8.3.1, hold for this test also. 


Example 8.3.6. Let the independent random variables X and Y have distributions 
that are N(0,,03) and N (62,04). In Example 8.3.1, we derived the likelihood ratio 
test statistic T’ of the hypothesis 6; = 62 when 63 = 64, while in Example 8.3.5 
we obtained the likelihood ratio test statistic F of the hypothesis 63 = 04. The 
hypothesis that 6; = 62 is rejected if the computed |T'| > c, where the constant c is 
selected so that ag = P(|T| > c; 01 = 02,03 = 04) is the assigned significance level of 
the test. We shall show that, if #3 = 64, the likelihood ratio test statistics for equality 
of variances and equality of means, respectively F' and T,, are independent. Among 
other things, this means that if these two tests based on F' and T, respectively, 
are performed sequentially with significance levels a; and a2, the probability of 
accepting both these hypotheses, when they are true, is (1 — a1)(1— a2). Thus the 
significance level of this joint test is a = 1 — (1 — a,)(1 — ag). 

Independence of F and T, when 63 = 64, can be established using sufficiency 
and completeness. The statistics X, Y, and 7} (X; — X)? + (Yi — Y)? are 
joint complete sufficient statistics for the three parameters 0,, 92, and 63 = 64. 
Obviously, the distribution of F' does not depend upon 61, 02, or 63 = 04, and hence 
F is independent of the three joint complete sufficient statistics. However, T is a 
function of these three joint complete sufficient statistics alone, and, accordingly, T 
is independent of F’. It is important to note that these two statistics are independent 
whether 0; = 62 or 6; 4 03. This permits us to calculate probabilities other than 
the significance level of the test. For example, if 63 = 64 and 0; 4 02, then 


Pla < F<, |T| >c)= Pla < F < ca)P(|T| > ). 


The second factor in the right-hand member is evaluated by using the probabilities 
of a noncentral t-distribution. Of course, if 63 = 04 and the difference 6; — 02 is 
large, we would want the preceding probability to be close to 1 because the event 
{al < F < c, |T| = c} leads to a correct decision, namely, accept 03 = 64 and 
reject 0, = 02. & 
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EXERCISES 


8.3.1. Verzani (2014) discusses a data set on healthy individuals, including their 
temperatures by gender. The data are in the file tempbygender .rda and the vari- 
ables of interest are maletemp and femaletemp. Download this file from the site 
listed in the Preface. 


(a) Obtain comparison boxplots. Comment on the plots. Which, if any, gen- 
der seems to have lower temperatures? Based on the width of the boxplots, 
comment on the assumption of equal variances. 


(b) As discussed in Example 8.3.3, compute the two-sample, two-sided t-test that 
there is no difference in the true mean temperatures between genders. Obtain 
the p-value of the test and conclude in terms of the problem at the nominal 
a-level of 0.05. 


(c) Obtain a 95% confidence interval for the difference in means. What does it 
mean in terms of the problem? 


8.3.2. Verify Equations (8.3.2) of Example 8.3.1 of this section. 
8.3.3. Verify Equations (8.3.3) of Example 8.3.1 of this section. 


8.3.4. Let X1,...,X, and Yj,..., Yin follow the location model 


X; => 0, + Zi, Val spe tt 
Y; = 92 + Zn+is t=1,...,m, 
where Z1,...,Zn+m are iid random variables with common pdf f(z). Assume that 


E(Z;) = 0 and Var(Z;) = 03 < co. 
(a) Show that E(X;) = 01, E(Y;) = 02, and Var(X;) = Var(Y;) = 63. 
(b) Consider the hypotheses of Example 8.3.1, i-e., 
Ag : 0, = 02 versus Hy: 0, 4 62. 


Show that under Ho, the test statistic T given in expression (8.3.4) has a 
limiting N(0, 1) distribution. 


(c) Using part (b), determine the corresponding large sample test (decision rule) 
of Ho versus H,. (This shows that the test in Example 8.3.1 is asymptotically 
correct.) 


8.3.5. In Example 8.3.2, the power function for the one-sample t-test is discussed. 


(a) Plot the power function for the following setup: X has a N (1,07) distribution; 
Ho: 1 = 50 versus Hy : uw 4 50; a = 0.05; n = 25; and o = 10. 


(b) Overlay the power curve in (a) with that for a = 0.01. Comment. 


(c) Overlay the power curve in (a) with that for n = 35. Comment. 
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(d) Determine the smallest value of n so the power exceeds 0.80 to detect = 53. 
Hint: Modify the R function tpowerg.R so it returns the power for a specified 
alternative. 


8.3.6. The effect that a certain drug (Drug A) has on increasing blood pressure 
is a major concern. It is thought that a modification of the drug (Drug B) will 
lessen the increase in blood pressure. Let w4 and sg be the true mean increases 
in blood pressure due to Drug A and B, respectively. The hypotheses of interest 
are Ho : wa = Lp = O versus Hy: wa > Up = 0. The two-sample t-test statistic 
discussed in Example 8.3.3 is to be used to conduct the analysis. The nominal level 
is set at a = 0.05 For the experimental design assume that the sample sizes are 
the same; i.e., m = n. Also, based on data from Drug A, 0 = 30 seems to be a 
reasonable selection for the common standard deviation. Determine the common 
sample size, so that the difference in means 44 — [4p = 12 has an 80% detection rate. 
Suppose when the experiment is over, due to patients dropping out, the sample sizes 
for Drugs A and B are respectively n = 72 and m = 68. What was the actual power 
of the experiment to detect the difference of 12? 


8.3.7. Show that the likelihood ratio principle leads to the same test when testing 
a simple hypothesis Ho against an alternative simple hypothesis Hj, as that given 
by the Neyman-Pearson theorem. Note that there are only two points in 2. 
8.3.8. Let X1, X2,..., Xp» bearandom sample from the normal distribution N (6, 1). 
Show that the likelihood ratio principle for testing Ho : 0 = 6’, where 6” is specified, 
against H, : 6 # 0’ leads to the inequality |% — 6’| > c. 


(a) Is this a uniformly most powerful test of Hp against Hy)? 
(b) Is this a uniformly most powerful unbiased test of Ho against Hy? 


8.3.9. Let X1, X2,..., Xn be iid N(1, 02). Show that the likelihood ratio principle 
for testing Ho : 02 = 04 specified, and 0, unspecified, against Hy : 02 4 65, 01 
unspecified, leads to a test that rejects when )7/'(@; —Z)? < ce; or 7} (ai — 2%)? > ca, 
where c; < C2 are selected appropriately. 


8.3.10. For the situation discussed in Example 8.3.5, derive the power function for 
the likelihood ratio test statistic given in expression (8.3.9). 


8.3.11. Let X),...,X, and Yj,...,Ym be independent random samples from the 
distributions N (01,63) and N(@2, 04), respectively. 


(a) Show that the likelihood ratio for testing Ho : 01 = 02, 03 = 64 against all 
alternatives is given by 


n n/2 > m m/2 
De : hy Su 7 win] 


where u = (n@ + my)/(n +m). 
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(b) Show that the likelihood ratio for testing Ho : 63 = 04 with 6; and 62 unspec- 
ified can be based on the test statistic F' given in expression (8.3.9). 


8.3.12. Let Yi < Yo <--: < Ys be the order statistics of a random sample of size 
n = 5 from a distribution with pdf f(a;0) = 4e7!"~l, —co < x < ow, for all real 0. 
Find the likelihood ratio test A for testing Hp : 0 = 09 against H; :0 4 Op. 


8.3.13. A random sample X1, X2,...,X, arises from a distribution given by 


Ho: f (0:9) == 


a? 0<a<486, zero elsewhere, 


or 
it —ax/0 
Ay: f(#;6) = a° , O0<a2<oo, zero elsewhere. 


Determine the likelihood ratio (A) test associated with the test of Hp against Hy. 


8.3.14. Consider a random sample X1, X2,...,X, from a distribution with pdf 
f(a;0) = (1 — x)°-!, 0 < x <1, zero elsewhere, where 6 > 0. 


(a) Find the form of the uniformly most powerful test of Hp : 6 = 1 against 
A,:@>1. 


(b) What is the likelihood ratio A for testing Hp : 6 = 1 against H, : 041? 


8.3.15. Let X1, Xo,..., X, and Yj, Yo,..., ¥, be independent random samples from 
two normal distributions N(1,07) and N(2,07), respectively, where o? is the 
common but unknown variance. 


(a) Find the likelihood ratio A for testing Ho : w1 = 2 = 0 against all alterna- 
tives. 


(b) Rewrite A so that it is a function of a statistic Z which has a well-known 
distribution. 


(c) Give the distribution of Z under both null and alternative hypotheses. 


8.3.16. Let (X1,¥1), (Xo, Y2),...,(Xn, Y,) be a random sample from a bivariate 
normal distribution with pi, 2,07 = 03 = 0°, p = 4, where p11, 2, and o? > 0 are 
unknown real numbers. Find the likelihood ratio A for testing Ho : 1 = fg = 0, 0? 
unknown against all alternatives. The likelihood ratio A is a function of what 


statistic that has a well-known distribution? 


8.3.17. Let X be a random variable with pdf fx (x) = (2bx)~1 exp{—|a|/bx}, for 
—oo < x < oo and bx > 0. First, show that the variance of X is 0% = 2b. Next, 
let Y, independent of X, have pdf fy (y) = (2by)~! exp{—|y|/by }, for —oo < x < 00 
and by > 0. Consider the hypotheses 


Ho: a = oe versus Hy : os > Gee. 
To illustrate Remark 8.3.2 for testing these hypotheses, consider the following data 
set (data are also in the file exercise8316.rda). Sample 1 represents the values 
of a sample drawn on X with bx = 1, while Sample 2 represents the values of a 
sample drawn on Y with by = 1. Hence, in this case Ho is true. 
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Sample | —0.389 —2.177 0.813 —0.001 
1 —0.110 —0.709 0.456 0.135 

Sample 0.763 —0.570 —2.565 —1.733 
1 0.403 0.778 —O0.115 

Sample | —1.067 —0.577 0.361 —0.680 
2 —0.634 —0.996 —0.181 0.239 

Sample | —0.775 —1.421 —0.818 0.328 
2 0.213 1.425 —0.165 


(a) Obtain comparison boxplots of these two samples. Comparison boxplots con- 
sist of boxplots of both samples drawn on the same scale. Based on these 
plots, in particular the interquartile ranges, what do you conclude about Ho? 


(b) Obtain the F-test (for a one-sided hypothesis) as discussed in Remark 8.3.2 
at level a = 0.10. What is your conclusion? 


(c) The test in part (b) is not exact. Why? 


8.3.18. For the skewed contaminated normal random variable X of Example 8.3.4, 
derive the cdf, pdf, mean, and variance of X. 


8.3.19. For Table 8.3.1 of Example 8.3.4, show that the half-width of the 95% 
confidence interval for a binomial proportion as given in Chapter 4 is 0.004 at the 
nominal value of 0.05. 


8.3.20. If computational facilities are available, perform a Monte Carlo study of 
the two-sided t-test for the skewed contaminated normal situation of Example 8.3.4. 
The R function rscn.R generates variates from the distribution of X. 


8.3.21. Suppose X,..., Xp is arandom sample on X which has a N(,0@) distri- 
bution, where of is known. Consider the two-sided hypotheses 


Ho: w=O0 versus Hi: w #0. 


Show that the test based on the critical region C = {[X| > /o2/nza/2} is an 
unbiased level a test. 


8.3.22. Assume the same situation as in the last exercise but consider the test 
with critical region C* = {X > \/o%/nza}. Show that the test based on C* has 
significance level a but that it is not an unbiased test. 


8.4 *The Sequential Probability Ratio Test 


Theorem 8.1.1 provides us with a method for determining a best critical region 
for testing a simple hypothesis against an alternative simple hypothesis. Recall its 
statement: Let X,, X2,...,X, be a random sample with fixed sample size n from 
a distribution that has pdf or pmf f(x;@), where 0 = {0:6 =6',60"} and 6’ and 6” 
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are known numbers. For this section, we denote the likelihood of X 1, X2,...,Xn 
by 
L(9;n) = f (2159) f(@2; 9) +++ f(an39), 

a notation that reveals both the parameter @ and the sample size n. If we reject 
Hy :0=06' and accept Hy, : 6 = 0” when and only when 

L(6’;n) 

L("sn) ~ 
where k > 0, then by Theorem 8.1.1 this is a best test of Hp against Ay. 

Let us now suppose that the sample size n is not fixed in advance. In fact, 
let the sample size be a random variable N with sample space {1,2,,3,...}. An 
interesting procedure for testing the simple hypothesis Hp : 6 = 0’ against the simple 
hypothesis H, : @ = 6” is the following: Let ko and k; be two positive constants 
with ko < ki. Observe the independent outcomes X,, X2, X3,... in a sequence, for 
example, 21, %2,73,..., and compute 

L(6';1) L(0';2) L(6’;3) 


L(0"; 1)" L(6"; 2)" L(6"; 3)" 


The hypothesis Hp : 6 = 0’ is rejected (and Hy : 6 = 0” is accepted) if and only if 
there exists a positive integer n so that x, = (21, %2,...,@,) belongs to the set 
L(6', j) L(6’,n) 


le hg eit Si he aad eS. eda 
“ {* OS ng en ae ee a) >} ae 


On the other hand, the hypothesis Hp : 6 = 6’ is accepted (and H, : 6 = 6” 
is rejected) if and only if there exists a positive integer n so that (x11, 22,...,2n) 
belongs to the set 


L(6",j) ; L(6',n) 
Bn=<§Xn:ko < OH <ky, f= 1,.-..,n-1, and ——~ Sky. (8.4.2 
{oho < Try <i oes wipe! Se 

That is, we continue to observe sample observations as long as 
L(6’,n) 


k cee Si ee 
°< Te" n) 


< ky. (8.4.3) 


We stop these observations in one of two ways: 
1. With rejection of Hp : 9 = 6’ as soon as 


L(6’,n) 
Se 
L(0",n) 
or 
2. With acceptance of Ho : 6 = 6’ as soon as 


L(6’,n) 
as SE 
L(6",n) — 
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A test of this kind is called Wald’s sequential probability ratio test. Fre- 
quently, inequality (8.4.3) can be conveniently expressed in an equivalent form: 


co(n) < u(a1,@2,-.-,%n) < e1(n), (8.4.4) 


where u(X1, X2,..., Xn) is a statistic and co(n) and c;(n) depend on the constants 
ko, k1, 0’, 0”, and on n. Then the observations are stopped and a decision is reached 
as soon as 


u(@1,@2,---,2n) <eo(n) or ula, x2,...,Ln) > cr (n). 


We now give an illustrative example. 
Example 8.4.1. Let X have a pmf 


: = g*(1— 0) z=0,1 
F(x;6) = { 0 elsewhere. 


In the preceding discussion of a sequential probability ratio test, let Hp : 6 = = and 


A, :0= 3; then, with }> 2; = 71%; 
L(3,n) = (4)eei(2yn- ha _ ee, 
TGm) QExGEs 


If we take logarithms to the base 2, the inequality 


L(3,n) 


L(,n) 


< ky, 


with 0 < ko < ky, becomes 


n 


logs ko < n—-2S a; < logs ki, 
1 


or, equivalently, in the notation of expression (8.4.4), 


n 


co(n) = 5 


1 . n 1 
5 1082 ky < dt < 3 = 3 1082 ko = ci(n). 


Note that L($,n)/L(2,n) < ko if and only if e1(n) < OY ai; and L($,n)/L(,n) > 
ky if and only if co(n) > Soy} 2;. Thus we continue to observe outcomes as long as 
co(n) < S°Y aj < c1(n). The observation of outcomes is discontinued with the first 
value of n of N for which either cj(n) < 7} x or co(n) > S07 aj. The inequality 
ci(n) < SO} a; leads to rejection of Hp : @ = | (the acceptance of H,), and the 
inequality co(n) > 5>j' a; leads to the acceptance of Ho : 6 = (the rejection of 
Hl). || 


Remark 8.4.1. At this point, the reader undoubtedly sees that there are many 
questions that should be raised in connection with the sequential probability ratio 
test. Some of these questions are possibly among the following: 
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1. What is the probability of the procedure continuing indefinitely? 


2. What is the value of the power function of this test at each of the points 6 = 6’ 
and 6 = 6”? 


3. If @” is one of several values of 6 specified by an alternative composite hypoth- 
esis, say H, : 6 > 6’, what is the power function at each point 6 > 6’? 


4. Since the sample size N is a random variable, what are some of the properties 
of the distribution of N? In particular, what is the expected value E(NV) of 
N? 


5. How does this test compare with tests that have a fixed sample size n? m 


A course in sequential analysis would investigate these and many other problems. 
However, in this book our objective is largely that of acquainting the reader with 
this kind of test procedure. Accordingly, we assert that the answer to question 1 
is zero. Moreover, it can be proved that if 0 = 0’ or if 6 = 6”, E(N) is smaller for 
this sequential procedure than the sample size of a fixed-sample-size test that has 
the same values of the power function at those points. We now consider question 2 
in some detail. 

In this section we shall denote the power of the test when Hp is true by the 
symbol a@ and the power of the test when Hy is true by the symbol 1 — @. Thus 
a is the probability of committing a Type I error (the rejection of Hp when Ho is 
true), and @ is the probability of committing a Type II error (the acceptance of Ho 
when Hg is false). With the sets C,, and B,, as previously defined, and with random 
variables of the continuous type, we then have 


a= | 20.m, 1-9-0 f Lorn. 


Since the probability is 1 that the procedure terminates, we also have 


1-a= 3 ff 20.0) b= f Le". 


n=1 


If (a1, 22,---,2n) € Cn, we have L(6’,n) < ko L(6",n); hence, it is clear that 


a=)> [ LO@.n)<S> [ koL(6",n) = ko(1 — 8). 


Because L(6’,n) > ki L(0",n) at each point of the set B,, we have 


Gh ie ye L(6’,n) > ~/ ky L(0",n) = kif. 
n=1 n n=1 Th 


Accordingly, it follows that 


(8.4.5) 
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provided that @ is not equal to 0 or 1. 
Now let aq and (, be preassigned proper fractions; some typical values in the 
applications are 0.01, 0.05, and 0.10. If we take 


Qa 1l—aq 
ko = ——, b= ; 
a ers 1 Ba 
then inequalities (8.4.5) become 
a < Qa foe g te (8.4.6) 


or, equivalently, 
a(1 — Ba) < l= 8 aia, Bil =oq) < (i Sa) Be. 


If we add corresponding members of the immediately preceding inequalities, we find 
that 
a+ B—aBq — Boa < Aa +t Ba — Ba — 8a 


and hence 
OnE B <q + Bas 


that is, the sum a+ ( of the probabilities of the two kinds of errors is bounded 
above by the sum a, + (, of the preassigned numbers. Moreover, since a and ( are 
positive proper fractions, inequalities (8.4.6) imply that 


Ba 


1— a,’ 


Qa 
tg,’ 


consequently, we have an upper bound on each of a and (7. Various investigations 
of the sequential probability ratio test seem to indicate that in most practical cases, 
the values of a and ( are quite close to a, and @,. This prompts us to approximate 
the power function at the points 6 = 6’ and @ = @” by aq and 1 — Gq, respectively. 


as 


pS 


Example 8.4.2. Let X be N(6,100). To find the sequential probability ratio 
test for testing Ho : 0 = 75 against H, : 8 = 78 such that each of a and £ is 
approximately equal to 0.10, take 


010 1 1—0.10 
ko = =g "Gag. 


Since 


(x; — 75)?/2(100)| — (-22 Ly — =) 
200 ; 


the inequality 


8.4. *The Sequential Probability Ratio Test 505 


can be rewritten, by taking logarithms, as 


65> a; — 459n 


— log 9 log 9. 
og9 < 200 < log 
This inequality is equivalent to the inequality 
153 100 153 100 
co(n) = -n — = lo << Ent Plog erin n). 


Moreover, £(75,n)/L(78,n) < ko and £(75,n)/L(78,n) > ky are equivalent to the 
inequalities 77’ 2; > ci(n) and 3} x; < co(n), respectively. Thus the observation 
of outcomes is discontinued with the first value of n of N for which either a r4> 
e1(n) or So) a; < co(n). The inequality S7/ x; > ci(n) leads to the rejection 
of Hy : 8 = 75, and the inequality S72; < co(n) leads to the acceptance of 

0 : 8 = 75. The power of the test is approximately 0.10 when Ho is true, and 
approximately 0.90 when H; is true. @ 


Remark 8.4.2. It is interesting to note that a sequential probability ratio test can 
be thought of as a random-walk procedure. To illustrate, the final inequalities of 
Examples 8.4.1 and 8.4.2 can be written as 


— logs ki < Soa xj — 0.5) < — logs ko 
and 


100 
Frlved < — 76.5) < = log9, 


respectively. In each instance, think of starting at the point zero and taking random 
steps until one of the boundaries is reached. In the first situation the random steps 
are 2(X1 — 0.5), 2(X2 — 0.5), 2(X3 — 0.5),..., which have the same length, 1, but 
with random directions. In the second instance, both the length and the direction 
of the steps are random variables, X, — 76.5, X2 — 76.5, X3 — 76.5,.... 


In recent years, there has been much attention devoted to improving quality 
of products using statistical methods. One such simple method was developed by 
Walter Shewhart in which a sample of size n of the items being produced is taken and 
they are measured, resulting in n values. The mean X of these n measurements has 
an approximate normal distribution with mean pz and variance o?/n. In practice, 
and a? must be estimated, but in this discussion, we assume that they are known. 
From theory we know that the probability is 0.997 that Z is between 


LCL =p and UCL = p+ 
n 


These two values are called the lower (LCL) and upper (UCL) control limits, respec- 
tively. Samples like these are taken periodically, resulting in a sequence of means, 
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say X1,€2,%3,.... These are usually plotted; and if they are between the LCL and 
UCL, we say that the process is in control. If one falls outside the limits, this 
would suggest that the mean ys has shifted, and the process would be investigated. 

It was recognized by some that there could be a shift in the mean, say from jz to 
t+ (a/./n); and it would still be difficult to detect that shift with a single sample 
mean, for now the probability of a single F exceeding UCL is only about 0.023. This 
means that we would need about 1/0.023 ~ 43 samples, each of size n, on the average 
before detecting such a shift. This seems too long; so statisticians recognized that 
they should be cumulating experience as the sequence X,, X2,X3,... is observed 
in order to help them detect the shift sooner. It is the practice to compute the 
standardized variable Z = (X — p)/(a/./n); thus, we state the problem in these 
terms and provide the solution given by a sequential probability ratio test. 

Here Z is N(0,1), and we wish to test Hp : 6 = 0 against H; : 0 = 1 using the 
sequence of iid random variables 271, Z2,...,Zm,.... We use m rather than n, as 
the latter is the size of the samples taken periodically. We have 


L(0,m) exp [— > z?/2] - a = 08) 


~ expl- ia —h2/y SP 


L(1,m) exp [— il — 1)?/2] 


Thus 


can be written as 


h= — logo > YK 2; — 0.5) > —logk; = —h. 


It is true that —logko = logk; when ag = (a. Often, h = —logko is taken 
to be about 4 or 5, suggesting that ag = (Gq is small, like 0.01. As $>(z; — 0.5) 
is cumulating the sum of z; — 0.5, 7 = 1,2,3,..., these procedures are often called 


CUSUMS. If the CUSUM = 3°(z;—0.5) exceeds h, we would investigate the process, 
as it seems that the mean has shifted upward. If this shift is to 6 = 1, the theory 
associated with these procedures shows that we need only eight or nine samples on 
the average, rather than 43, to detect this shift. For more information about these 
methods, the reader is referred to one of the many books on quality improvement 
through statistical methods. What we would like to emphasize here is that through 
sequential methods (not only the sequential probability ratio test), we should take 
advantage of all past experience that we can gather in making inferences. 


EXERCISES 


8.4.1. Let X be N(0,@) and, in the notation of this section, let 0’ = 4, 0” = 
Qq = 0.05, and 6, = 0.10. Show that the sequential probability ratio test can be 
based upon the statistic 77’ X?. Determine co(n) and c;(n). 
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8.4.2. Let X have a Poisson distribution with mean 6. Find the sequential proba- 
bility ratio test for testing Ho : 0 = 0.02 against H, : @ = 0.07. Show that this test 
can be based upon the statistic pay X;. If aq = 0.20 and 6, = 0.10, find co(n) and 
C1 (n). 


8.4.3. Let the independent random variables Y and Z be N(j1,1) and N(p2, 1), 
respectively. Let 9 = j41 — fg. Let us observe independent observations from 
each distribution, say Y1, Y2,... and 2), Z2,.... To test sequentially the hypothesis 
Ho: @=0 against Hy : 0 = 5, use the sequence X; = Y; — Z;, i = 1,2,.... If 
Qa = Ga = 0.05, show that the test can be based upon X = Y — Z. Find co(n) and 


C1 (n). 


8.4.4. Suppose that a manufacturing process makes about 3% defective items, 
which is considered satisfactory for this particular product. The managers would 
like to decrease this to about 1% and clearly want to guard against a substantial 
increase, say to 5%. To monitor the process, periodically n = 100 items are taken 
and the number X of defectives counted. Assume that X is b(n = 100,p = 0). 
Based on a sequence X 1, X2,...,Xm,..., determine a sequential probability ratio 
test that tests Hp : @ = 0.01 against Hy, : 0 = 0.05. (Note that 6 = 0.03, the present 
level, is in between these two values.) Write this test in the form 


m 


ho > ye = nd) >hy 
i=1 


and determine d, ho, and hy, if ag = Ga = 0.02. 


8.4.5. Let X1, Xo,..., Xp bearandom sample from a distribution with pdf f(x; @) = 
629-1, 0 < a <1, zero elsewhere. 


(a) Find a complete sufficient statistic for 0. 


(b) Ifag = 6. = a find the sequential probability ratio test of Hp : 0 = 2 against 
Ay :0=3. 


8.5 *Minimax and Classification Procedures 


We have considered several procedures that may be used in problems of point es- 
timation. Among these were decision function procedures (in particular, minimax 
decisions). In this section, we apply minimax procedures to the problem of testing a 
simple hypothesis Ho against an alternative simple hypothesis H,. It is important 
to observe that these procedures yield, in accordance with the Neyman—Pearson 
theorem, a best test of Hp against H,. We end this section with a discussion on an 
application of these procedures to a classification problem. 


8.5.1 Minimax Procedures 


We first investigate the decision function approach to the problem of testing a simple 
null hypothesis against a simple alternative hypothesis. Let the joint pdf of the n 
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random variables X1, X2,...,X;, depend upon the parameter 6. Here n is a fixed 
positive integer. This pdf is denoted by L(0; 21, 22,...,@,,) or, for brevity, by L(6). 
Let 6’ and 6” be distinct and fixed values of 6. We wish to test the simple hypothesis 
Hy :6=06' against the simple hypothesis H; : 6 = 6”. Thus the parameter space is 
Q = {0:0=06',0"}. In accordance with the decision function procedure, we need 
a function 6 of the observed values of Xy,...,X, (or, of the observed value of a 
statistic Y) that decides which of the two values of 6, 6’ or 0”, to accept. That is, 
the function 6 selects either Ho : 6 = 6’ or H, : 0 = 0”. We denote these decisions 
by 6 = 6 and 6 = @”, respectively. Let £(0,6) represent the loss function associated 
with this decision problem. Because the pairs (0 = 6’,6 = 0’) and (0 = 0”,5 = 0") 
represent correct decisions, we shall always take £(0’, 6’) = £(0”, 6”) = 0. On the 
other hand, if either 6 = 0” when 6 = 6’ or 6 = 0’ when 0 = 0”, then a positive value 
should be assigned to the loss function; that is, £(6’, 0”) > 0 and £(0”, 6’) > 0. 

It has previously been emphasized that a test of Ho : 6 = 6’ against H, : 0 = 0” 
can be described in terms of a critical region in the sample space. We can do the 
same kind of thing with the decision function. That is, we can choose a subset of C 
of the sample space and if (#1,22,...,%n) € C, we can make the decision 6 = 0”; 
whereas if (41, 22,...,2%n) € C%, the complement of C, we make the decision 6 = 6’. 
Thus a given critical region C determines the decision function. In this sense, we 
may denote the risk function by R(0,C) instead of R(#,5). That is, in a notation 
used in Section 7.1, 


R(6,C) = R,5) = f L(0,5)L (6). 


Cuce 


Since 6 = 6” if (a1,...,@) € Cand 6 = 0 if (@1,..., an) € C°, we have 
R(0,C) = [c.0 20) + J £00) (8.5.1) 
If, in Equation (8.5.1), we take 6 = 0’, then £(0’, 6’) = 0 and hence 
R(0',C) = [eo enee’ = c(0',0") | £0 


On the other hand, if in Equation (8.5.1) we let 6 = 0”, then £(0”,0”) = 0 and, 
accordingly, 


R(6",C)= | £(6",6’)L(6") = £(6", 6’) / re". 
Ce c 


It is enlightening to note that if y(@) is the power function of the test associated 
with the critical region C, then 


R(6',C) = £(6", ")y(6") = £6", 6")ay, 
where a = (0) is the significance level; and 


R",C) = LO", 6) [1 — 7(6")] = LO", ')B, 
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where 6 = 1 — 7(6”) is the probability of the type II error. 
Let us now see if we can find a minimax solution to our problem. That is, we 
want to find a critical region C' so that 


max[R(0’,C), R(@”, C)| 


is minimized. We shall show that the solution is the region 


L(0': Shoe Dix 
C= [leases sty) s At) cg 


(QU Gre vay Cp) 


provided the positive constant k is selected so that R(6’,C) = R(6”,C). That is, if 
k is chosen so that 


c(o'.0") | 1) = c",8) | 16"), 


then the critical region C' provides a minimax solution. In the case of random vari- 
ables of the continuous type, k can always be selected so that R(0’,C) = R(6”,C). 
However, with random variables of the discrete type, we may need to consider an 
auxiliary random experiment when L(0’)/L(0”) = k in order to achieve the exact 
equality R(@’,C) = R(0",C). 

To see that C is the minimax solution, consider every other region A for which 
R(0’,C) > R(0’, A). A region A for which R(0’,C) < R(6’, A) is not a candidate for 
a minimax solution, for then R(6’,C) = R(0”,C) < max[R(6’, A), R(0”, A)]. Since 
R(6',C) > R(6’, A) means that 


TC > £(0',0") | 10), 


a= f 10) > | 20 


that is, the significance level of the test associated with the critical region A is less 
than or equal to a. But C, in accordance with the Neyman—Pearson theorem, is a 
best critical region of size a. Thus 


[eee [20 
fe < I. L(0"). 


(oe) f L(0") < £(",6) | L(0"), 


c 


we have 


and 


Accordingly, 
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or, equivalently, 
R(@",C) < R(@", A). 


That is, 
Re’ ,.C) = RO6"",C) < Re", A). 


This means that 
max[R(0’,C), R(0”,C)] < R(0”, A). 


Then certainly, 
max[R(0’,C), R(@”,C)| < max[R(0’, A), R(0”, A)], 
and the critical region C’ provides a minimax solution, as we wanted to show. 


Example 8.5.1. Let X,, X2,...,X1i99 denote a random sample of size 100 from 
a distribution that is N(6,100). We again consider the problem of testing Ho : 
6 = 75 against H, : 0 = 78. We seek a minimax solution with £(75,78) = 3 and 
£(78,75) = 1. Since L(75)/L(78) < k is equivalent to Z > c, we want to determine 
c, and thus k, so that 


3P(X See =) a PX < et = 7s): (8.5.2) 
Because X is N (6,1), the preceding equation can be rewritten as 
3[1 — ®(c — 75)] = ®(c — 78). 


As requested in Exercise 8.5.4, the reader can show by using Newton’s algorithm 
that the solution to one place is c = 76.8. The significance level of the test is 
1 — ®(1.8) = 0.036, approximately, and the power of the test when Hj is true is 
1 — ®(—1.2) = 0.885, approximately. m 


8.5.2 Classification 


The summary above has an interesting application to the problem of classification, 
which can be described as follows. An investigator makes a number of measurements 
on an item and wants to place it into one of several categories (or classify it). 
For convenience in our discussion, we assume that only two measurements, say 
X and Y, are made on the item to be classified. Moreover, let X and Y have 
a joint pdf f(x,y;@), where the parameter 6 represents one or more parameters. 
In our simplification, suppose that there are only two possible joint distributions 
(categories) for X and Y, which are indexed by the parameter values 0’ and 0”, 
respectively. In this case, the problem then reduces to one of observing X = x and 
Y = y and then testing the hypothesis 6 = 6’ against the hypothesis 6 = 6”, with 
the classification of X and Y being in accord with which hypothesis is accepted. 
From the Neyman-—Pearson theorem, we know that a best decision of this sort is of 
the following form: If 

f(x, y39') 


ee 
jinge") ~~ 
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choose the distribution indexed by 6”; that is, we classify (a, y) as coming from the 
distribution indexed by 6”. Otherwise, choose the distribution indexed by 6’; that 
is, we classify (a, y) as coming from the distribution indexed by 0’. Some discussion 
on the choice of & follows in the next remark. 


Remark 8.5.1 (On the Choice of k). Consider the following probabilities: 


mw = P{(X,Y) is drawn from the distribution with pdf f(x, y; 6’)| 
mn’ = P{(X,Y) is drawn from the distribution with pdf f(x,y; 6”)]. 


Note that 7’ +7” = 1. Then it can be shown that the optimal classification rule 
is determined by taking k = 7”/z'; see, for instance, Seber (1984). Hence, if we 
have prior information on how likely the item is drawn from the distribution with 
parameter 6’, then we can obtain the classification rule. In practice, it is common 
for each distribution to be equilikely, in which case, 7’ = 7’ = 1/2 and, hence, 
k=1.0 


Example 8.5.2. Let (x,y) be an observation of the random pair (X,Y), which has 
a bivariate normal distribution with parameters /11, W2,07,03, and p. In Section 3.5 
that joint pdf is given by 


1 
f(x,y; [41, [2, 07,02) = e Ue YiH1M2)/2 
als 2ro02V/1— p? 


for —co < @ < co and —o < y < ~, where og; > 0, 02 > 0, -1 <p <1, and 


2 2 
1 ve HA D— [ly Yy — be Y — [2 
q(@,Y5 Ha, M2) = F :|( ) -2p( )(4 ger ||, 
—?p O71 O1 02 02 


Assume that 07, 03, and p are known but that we do not know whether the respective 
means of (X,Y) are (14, 45) or (wi, uy). The inequality 


f(@, Ys Hay Ha O11 02)P) — , 
f(@, ys HY ME, 07,03, p) ~ 
is equivalent to 

sla(@, ys ut, WZ) — a(@, ys Hh, M2)] < log k. 
Moreover, it is clear that the difference in the left-hand member of this inequality 
does not contain terms involving x”, xy, and y?. In particular, this inequality is the 
same as 


1 Mi— Hr pla H2)) [eee _ lea — Ht) r 
2 OF 0102 Ce 0102 


<logk + $[a(0, 0; 14, 45) — (0,0; uw, w3)], 


or, for brevity, 
ax + by <c. (8.5.3) 
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That is, if this linear function of 2 and y in the left-hand member of inequality 
(8.5.3) is less than or equal to a constant, we classify (2,y) as coming from the 
bivariate normal distribution with means py and wy. Otherwise, we classify (2, y) 
as arising from the bivariate normal distribution with means ju, and 4. Of course, 
if the prior probabilities can be assigned as discussed in Remark 8.5.1 then k and 
thus c can be found easily; see Exercise 8.5.3. ™ 


Once the rule for classification is established, the statistician might be interested 
in the two probabilities of misclassifications using that rule. The first of these two is 
associated with the classification of («, y) as arising from the distribution indexed by 
0” if, in fact, it comes from that index by 6’. The second misclassification is similar, 
but with the interchange of 6’ and 6”. In the preceding example, the probabilities 
of these respective misclassifications are 


P(aX +bY <cyyi,uy) and P(aX +bY > cpl, 5). 


The distribution of Z = aX + bY is obtained from Theorem 3.5.2. It follows 
that the distribution of Z = aX + bY is given by 


N(apy + bp, gon + 2abpoyo2 + b?o3). 


With this information, it is easy to compute the probabilities of misclassifications; 
see Exercise 8.5.3. 

One final remark must be made with respect to the use of the important classi- 
fication rule established in Example 8.5.2. In most instances the parameter values 
Ly, b> and py, us as well as 07,03, and p are unknown. In such cases the statis- 
tician has usually observed a random sample (frequently called a training sample) 
from each of the two distributions. Let us say the samples have sizes n’ and n”, 
respectively, with sample characteristics 

Woe) ea) and ei yea 


The statistics r’ and r” are the sample correlation coefficients, as defined in ex- 


pression (9.7.1) of Section 9.7. The sample correlation coefficient is the mle for the 
correlation parameter p of a bivariate normal distribution; see Section 9.7. If in 
inequality (8.5.3) the parameters yu, u5, u/,u5,07,05, and poic2 are replaced by 
the unbiased estimates 


ft eat att tl (n' = 1)(s',)? + (n" =~ 1)(e0)? (n! = 1)(s,,)? a (n" _ 1)(sy))? 


TY ,e yy 
1G > > > ni+n!—2 ry n+n'—2 ? 
/ Tal al 1M GW 
Ke = Lr 8 oe ie Le ss, 
nt+n'—2 , 


the resulting expression in the left-hand member is frequently called Fisher’s lin- 
ear discriminant function. Since those parameters have been estimated, the 
distribution theory associated with aX + bY does provide an approximation. 

Although we have considered only bivariate distributions in this section, the 
results can easily be extended to multivariate normal distributions using the results 
of Section 3.5; see also Chapter 6 of Seber (1984). 


8.5. *Minimax and Classification Procedures 513 


EXERCISES 


8.5.1. Let X,, X2,..., X29 be a random sample of size 20 from a distribution that 
is N(6,5). Let L(@) represent the joint pdf of X1,X2,..., X20. The problem is to 
test Hp: @ = 1 against H,:0=0. Thus Q = {6:0=0,1}. 


(a) Show that L(1)/L(0) < k is equivalent to F< c. 


(b) Find ¢ so that the significance level is a = 0.05. Compute the power of this 
test if H, is true. 


(c) Ifthe loss function is such that £(1,1) = £(0,0) = 0 and L(1,0) = £(0,1) > 0, 
find the minimax test. Evaluate the power function of this test at the points 
96=landéd=0. 


8.5.2. Let X,, X2,...,X19 be a random sample of size 10 from a Poisson distribu- 
tion with parameter 0. Let L(@) be the joint pdf of X1, X2,...,X19. The problem 
is to test Hp : 0 = “ against H,:0=1. 

(a) Show that L(5)/L(1) < k is equivalent to y = 7} a; > c. 


(b) In order to make a = 0.05, show that Hp is rejected if y > 9 and, if y = 9, 
reject Ho with probability 4 (using some auxiliary random experiment). 


(c) If the loss function is such that £(5,5) = C(1,1) = 0 and £(5,1) = 1 and 
L(1,4) = 2, show that the minimax procedure is to reject Ho if y > 6 and, if 
y = 6, reject Ho with probability 0.08 (using some auxiliary random experi- 
ment). 


M 


8.5.3. In Example 8.5.2 let ui, = uh =0, pf = wy = 1, 0? = 1, o2 =1, and p=. 


(a) Find the distribution of the linear function aX + bY. 


(b) With & = 1, compute P(aX +bY < cui = 5 =0) and P(aX +bY > Gp = 
fy = 1). 


8.5.4. Determine Newton’s algorithm to find the solution of Equation (8.5.2). If 
software is available, write a program that performs your algorithm and then show 
that the solution is c = 76.8. If software is not available, solve (8.5.2) by “trial and 
error.” 


8.5.5. Let X and Y have the joint pdf 


e(-Z Hy. 0<4r<w, 0<y<o, 


1 
F(a, y; 01, 02) = 5 is. 


102 


zero elsewhere, where 0 < 61, 0 < 62. An observation (x,y) arises from the joint 
distribution with parameters equal to either (0, = 1,05 = 5) or (0/ = 3,09 = 2). 
Determine the form of the classification rule. 
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8.5.6. Let X and Y have a joint bivariate normal distribution. An observation 
(x,y) arises from the joint distribution with parameters equal to either 


Hy = Hy = 0, (of) = (03) =1, p= 5 
or 
pil = uh =1, (03) =4, (03)" =9, p=}. 
Show that the classification rule involves a second-degree polynomial in x and y. 


8.5.7. Let W’ = (W1,W2) be an observation from one of two bivariate normal 
distributions, I and II, each with 4 = 2 = 0 but with the respective variance- 
covariance matrices 


1 0 3. «(0 
vi=(o i) and vo=(5 ae 


How would you classify W into I or II? 


Chapter 9 


Inferences About Normal 
Linear Models 


9.1 Introduction 


In this chapter, we consider analyses of some of the most widely used linear mod- 
els. These models include one- and two-way analysis of variance (ANOVA) models 
and regression and correlation models. We generally assume normally distributed 
random errors for these models. The inference procedures that we discuss are, for 
the most part, based on maximum likelihood procedures. The theory requires some 
discussion of quadratic forms which we briefly introduce next. 


Consider polynomials of degree 2 in n variables, X1,..., Xp, of the form 
n n 
q(X1, ee Xn) = x pee eee 
i=1 j=l 
for n? constants a;;. We call this form a quadratic form in the variables X1,..., Xn. 


If both the variables and the coefficients are real, it is called a real quadratic 
form. Only real quadratic forms are considered in this book. To illustrate, the form 
x? + X1X_o+ X2 is a quadratic form in the two variables X; and X2; the form 
X?+X$ +X? —2X Xo is a quadratic form in the three variables X;, X2, and X3; 
but the form (X; — 1)? + (X_— 2)? = X? + X2? — 2X, — 4X2 +5 is not a quadratic 
form in X; and X92, although it is a quadratic form in the variables X; — 1 and 
Xo — 2. 


Let X and S? denote, respectively, the mean and variance of a random sample 


515 
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X 1, X2,...,X» from an arbitrary distribution. Thus 


(n-1)8? = S(%;-X)? => x? - aX” 
i=1 i=1 
i 2 
_ 2 
n 1 n n 
- e-i(yxyxy 
i=1 i=1 j=l 
= ext 2 [yxy nn 
i=1 i=1 i<j 
ay BEN 
i=1 i<j 
So the sample variance is a quadratic form in the variables Xj,...,Xn. 


9.2 One-Way ANOVA 


Consider b independent random variables that have normal distributions with un- 
known means /11, [/2,-.-, 45, respectively, and unknown but common variance o?. 
For each 7 = 1,2,...,6, let X1j;, X2;,...,Xn,;j represent a random sample of size 
n,; from the normal distribution with mean pz; and variance o?. The appropriate 
model for the observations is 


Xig = by + Cay 5 6S 1 NH Lys ee, d, (9.2.1) 


where e;; are iid N(0,07). Let n = a n,; denote the total sample size. Suppose 
that it is desired to test the composite hypothesis 


Ho: f1 = bo =--: = fp versus Hy: pw; F pj, for some j F 7’. (9.2.2) 


We derive the likelihood ratio test for these hypotheses. 

Such problems often arise in practice. For example, suppose for a certain type 
of disease there are b drugs that can be used to treat it and we are interested 
in determining which drug is best in terms of a certain response. Let X,; denote 
this response when drug j is applied and let uy; = E(X;). If we assume that X, 
is N(j;,07), then the above null hypothesis says that all the drugs are equally 
effective; see Exercise 9.2.6 for a numerical illustration of this situation involving 
drugs that are intended to lower cholesterol. In general, we often summarize this 
problem by saying that we have one factor at b levels. In this case the factor is the 
treatment of the disease and each level corresponds to one of the treatment drugs. 

Model (9.2.1) is called a one-way model. As shown, the likelihood ratio test 
can be thought of in terms of estimates of variance. Hence, this is an example of an 
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analysis of variance (ANOVA). In short, we say that this example is a one-way 
ANOVA problem. 
Here the full model parameter space is 


Q = { (p11, fl2,---, Mp, 07) 1-00 < ply <0, Uae" = 6c), 


while the reduced model (full model under Ho) parameter space is 


w = { (11, fa,.-+, fp, 0") : 00 < py = fe = ++ = py = pp < 20, 0< 0? < oo}. 


The likelihood functions, denoted by Z(Q) and L(w) are, respectively, 


and 


We first consider the reduced model. Notice that it is just a one sample model 
with sample size n from a N(1,07) distribution. We have derived the mles in 
Example 4.1.3 of Chapter 4, which, in this notation, are given by 


x b nj = A b nj = 
fo = & Djar Dats Bg = B.. and 62 = 20 V2 (ey — B..). (9.2.3) 


The notation Z.. denotes that the mean is taken over both subscripts. This is often 
called the grand mean. Evaluating L(w) at the mles, we obtain after simplification: 


L(é) = (t)"" (gyre, (9.2.4) 


Next, we consider the full model. The log of its likelihood is 


log L(Q) = —(n/2) log(27) — (n/2) log(a = 3 tig — pj)?» (9.2.5) 


j=l i=l 


For j =1,...,, the partial of the log of L(Q) with respect to jz; results in 


“eee 
7 = Dl M3): 


Setting this partial to 0 and solving for j4;, we obtain the mle of 4; which we denote 
by 


. i — 2G 
jig = — Dj = Ty, j=l,...,b. (9.2.6) 
J j=1 
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Since this derivation did not depend on a, to find the mle of 7, we substitute 7.,; 
for yz; in the log L(Q). Taking the partial derivative with respect to o we then get 


nj 


Olog L(Q 20, 1 : 
ag Noa + aL DMs Fs) 


Solving this for o?, we obtain! the mle 
1 b Nj 
64 = — eis -@,)*. (9.2.7) 
jeliei 


Substituting these mles for their respective parameters in L(Q), after some simpli- 


fication, leads to 
. iT n/2 i n/2 
L(Q) = (x) (=) es, (9.2.8) 
2a ore) 


Hence, the likelihood ratio test rejects Ho in favor of H; for small values of the 
statistic A = L(®)/L(Q) or equivalently, for large values of A~2/". We can express 
this test statistic as a ratio of two quadratic forms Q3 and Q as 


5 b 5 = 
fe pe Yea Ge —f..)? 
Q 
_ rom (9.2.9) 


In order to rewrite the test statistic in terms of an F’-statistic, consider the identity 
involving Q, Q3, and another quadratic form Q,4 given by: 


Q = YMey-2) =O Ves -2) + - 27 


j=l i=1 j=l i=l 
db ni b 
= Mey - 25) + Do ny(B5 - 2.)? 
j=l i=1 j=l 
=dfn @3 + Qa. (9.2.10) 


This derivation follows because the cross product term in the second line is 0. Using 
this identity, the test statistic A~?/" can be expressed as 


AW2/n = Q3+ Qa 14 %4 
Qs Qs 
As the final version, note that the test rejects Ho if F' is too large where 
Qa/(b— 1) 
= ———. 9.2.11 
Qa[(n =) — 


1We are using the fact that the mle of o? is the square of the mle of o. 
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To complete the test, we need to determine the distribution of F under Ho. 
First consider the sum of squares in the denominator, @3, which we write as: 


arjo? = 3-1 4 ak ae, e rr}. 


j=l 


Notice, since we are discussing distributions, we are now using random variable 
notation. By Part (c) of Theorem 3.6.1, for 7 = 1,...,b, the term within the braces 
has a y?-distribution with n; — 1 degrees of freedom. Further, the samples are 
independent so these x? random variables are independent. Hence, by Corollary 
3.3.1, Q3/o? has a x?-distribution with woe (n; — 1) =n-—b degrees of freedom. 
By Part (b) of Theorem 3.6.1, the random variable X.; is independent of the sum 
of squares within the braces and further, by the independence of the samples, it 
is independent of Qs. Thus, all b sample means are independent of Q3. Because 
X= ae n;X.;, the grand mean X.. is a function of the b sample means, it 
must be independent of Q3, also. Therefore, Q4 is independent of Q3. For the 
distribution of the numerator sum of squares, write the identity (9.2.10) as 


Q/o? = Q3/07 + Qa/o?. 


For the left side, under Ho, Q/o? has a x?-distribution with n—1 degrees of freedom. 
On the right side Q3/o? has a y?-distribution with n — b degrees of freedom and it 
is also independent of Q4/c?. By equating the mgfs of both sides, it follows that 
Qa/o? has a x?-distribution with (n — 1) — (n — b) = b— 1 degrees of freedom. 
Therefore, under Ho, the F test statistic, (9.2.11), has a F-distribution with b — 1 
and n — b degrees of freedom. 

Suppose now that we wish to compute the power of the test of Ho against Hy 
when Ho is false, that is, when we do not have 1; = 2 =--: = pp. In Section 
9.3 we show that under H), Q4/o? no longer has a y?(b— 1) distribution. Thus we 
cannot use an F-statistic to compute the power of the test when Hj is true. The 
problem is discussed in Section 9.3. 

Next, based on a simple example, we illustrate the computation of the F-test 
using R. 


Example 9.2.1. Devore (2012), page 412, presents a data set where the response 
is the elastic modulus for an alloy that is cast by one of three different casting 
processes. The null hypothesis is that the mean of the elastic modulus is not affected 
by the casting process. The data are: 


Elastic Modulus 


Phste mod [0 BS) HS THI /BIPHS, |_| 


The data are in the file elasticmod.rda. The variable elasticmod contains the 
response while the variable ind contains the casting method (1, 2, or 3). The R 
code and results (test statistic F and the p-value) are: 
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oneway.test (elasticmod~ ind, var.equal=T) 

F = 12.565, num df = 2, denom df = 19, p-value = 0.0003336 
With such a low p-value, the null hypothesis would be rejected and we would con- 
clude that the casting method does have an effect on the elastic modulus. m 


In this example, the experimenter would also be interested in the pairwise com- 
parisons of the casting methods. We consider this in Section 9.4. 


EXERCISES 


9.2.1. Consider the T-statistic that was derived through a likelihood ratio for test- 
ing the equality of the means of two normal distributions having common variance 
in Example 8.3.1. Show that T? is exactly the F-statistic of expression (9.2.11). 


9.2.2. Under Model (9.2.1), show that the linear functions X;; —X_; and X_;—X., 
are uncorrelated. 

Hint: Recall the definition of X. j and X.. and, without loss of generality, we can 
let E(X;;) =0 for all i,j. 


9.2.3. The following are observations associated with independent random sam- 
ples from three normal distributions having equal variances and respective means 


M1, 2, 13. 


I dW dl 
0.5 2.1 3.0 
1.3 3.3 5.1 

-10 0.0 1.9 
1.8 2.3 2.4 
2.5 4.2 

4.1 


Using R or another statistical package, compute the F-statistic that is used to test 
Ho : fa = He = bs. 


9.2.4. Let X1, Xo,...,X» be arandom sample from a normal distribution N(, 07). 
Show that 


where X = 2"_, X;/n and X = 0", Xi/(n—1). 

Hint: Replace X;—X by (X;—-X)—(X1—X’)/n. Show that 37"_,(X;-X )?/o? 
has a chi-square distribution with n — 2 degrees of freedom. Prove that the two 
terms in the right-hand member are independent. What then is the distribution of 


[(n — 1)/n\(X1 — X’)? 


? 
2 


oO 
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9.2.5. Using the notation of this section, assume that the means satisfy the con- 
dition that uw = py + (b—1)d = po —d = p3 —d = --- = py —d. That is, the 
last b— 1 means are equal but differ from the first mean ju1, provided that d 4 0. 
Let independent random samples of size a be taken from the 6 normal distributions 
with common unknown variance o?. 


(a) Show that the maximum likelihood estimators of and d are fi = X., and 


X5/(b-1I)- Xa 


Ms 


b 


(b) Using Exercise 9.2.4, find Qs and Q7 = cd? so that, when d = 0, Q7/o? is 
x?(1) and 
a b 
>>> (%y - X..)? = Qa + Oe + Qr. 


i=1 j=1 


(c) Argue that the three terms in the right-hand member of part (b), once di- 
vided by o?, are independent random variables with chi-square distributions, 
provided that d= 0. 


(d 


a 


The ratio Q7/(Q3 + Qe) times what constant has an F’-distribution, provided 
that d= 0? Note that this F is really the square of the two-sample T' used to 
test the equality of the mean of the first distribution and the common mean 
of the other distributions, in which the last b — 1 samples are combined into 
one. 


9.2.6. On page 123 of their text, Kloke and McKean (2014) present the results of an 
experiment investigating 4 drugs (treatments) for their effect on lowering LDL (low 
density lipids) cholesterol. For the experimental design, 39 quail were randomly 
assigned to one of the 4 drugs. The drug was mixed in their food, but, other than 
this, the quail were all treated in the same way. After a specified period of time, 
the LDL level of each quail was determined. The first drug was a placebo, so the 
interest is to see if any other of the drugs resulted in lower LDL than the placebo. 
The data are in the file quailldl.rda. The first column of this matrix contains the 
drug indicator (1 through 4) for the quail while the second column contains the Idl 
level of that quail. 


(a) Obtain comparison boxplots of LDL levels. Which drugs seem to result in 
lower LDL levels? Identify, by observation number, the outliers in the data. 


(b) Compute the F-test that all mean levels of LDL are the same for all 4 drugs. 
Report the F-test statistic and p-value. Conclude in terms of the problem 
using the nominal significance level of 0.05. Use the R code in Example 9.2.1. 


(c) Does your conclusion in Part (b) agree with the boxplots of Part (a)? 
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(d) Note that one assumption for the F-test is that the random errors e;; in Model 
(9.2.1) are normally distributed. An estimate of e;; is x;; — .;. These are 
called residuals, i.e., what is left after the full model fit. Compute these 
residuals and then obtain a histogram, a boxplot, and a normal q—q plot of 
them. Comment on the normality assumption. Use the code: 

resd <- lm(quailmat[,2]~factor(quailmat[,1]))$resid 
par (mfrow=c(2,2));hist(resd); boxplot(resd); qqnorm(resd) 


9.2.7. Let p11, 2, U3 be, respectively, the means of three normal distributions with 
a common but unknown variance o?. In order to test, at the a = 5% significance 
level, the hypothesis Ho : “1 = W2 = fg against all possible alternative hypotheses, 
we take an independent random sample of size 4 from each of these distributions. 
Determine whether we accept or reject Ho if the observed values from these three 
distributions are, respectively, 


Xi: 5 9 6 8 
Xo: 11 13 10 12 
X3: 10 6 9 9 


9.2.8. The driver of a diesel-powered automobile decided to test the quality of three 
types of diesel fuel sold in the area based on mpg. Test the null hypothesis that the 
three means are equal using the following data. Make the usual assumptions and 
take a = 0.05. 


Brand A: 38.7 39.2 40.1 38.9 
Brand B: 41.9 42.3 41.3 
Brand C: 40.8 41.2 39.5 38.9 40.3 


9.3. Noncentral \? and F-Distributions 


Let X,, X2,...,X» denote independent random variables that are N(y;,07), i = 
1,2,...,m, and consider the quadratic form Y = )7} X?/o?. If each pi; is zero, we 
know that Y is y?(n). We shall now investigate the distribution of Y when each 1; 
is not zero. The mef of Y is given by 


Consider 
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The integral exists if t < 4. To evaluate the integral, note that 


ta? (wi mi)? —— aR(1—2t) | Quimi wt 


o? 202 202 202 202 
tu? 1— 2t wi \” 
Sp ; 
o?(1 — 2t) 20? 1 —2¢ 


Accordingly, with t < 4, we have 


a 2 
E |exp Me = exp _ te / ec exp cet Li a da;. 
a? o7(1 — 2t) | J_., o/2a 207 1 — 2t 


If we multiply the integrand by V1 — 2t, t < 5 5, we have the integral of a normal 
pdf with mean p;/(1 — 2t) and variance o2/(1 — 2t). Thus 


# iG 1. ty? 
ex aera = S—— EXP | = 
Pe T-2 (eG —2)]’ 


and the mef of Y = )>} X?/o? is given by 


1 Ce ie 1 
M(t) = ———; aes t<-n-. 9.3.1 
1) = Gaya oP ee |. <3 oe 
A random variable that has the mgf 
1 
M(t) = ————,e*/0--28), (9.3.2) 


(1 — 2t)"/2 
where t < 4, 0 < 0, and r is a positive integer, is said to have a noncentral 
chi-square distribution with r degrees of freedom and noncentrality parameter 
9. If one sets the noncentrality parameter @ = 0, one has M(t) = (1 — 2t)~"/?, 
which is the mgf of a random variable that is x?(r). Such a random variable can 
appropriately be called a central chi-square variable. We shall use the symbol 
x?(r,@) to denote a noncentral chi-square distribution that has the parameters r 
and 6; and we shall say that a random variable is .?(r, 0) when that random variable 
has this kind of distribution. The symbol y?(r,0) is couvalom to x?(r). Thus our 
random variable Y = )>} X?/o? of this section is x? (n, 0} ?/07). The mean of 
Y is given by 


1 n 
2 = 
EY) = a = I . E(X?) = 2 a +2) aon+ 8, (9.3.3) 


i.e., the mean of the central y? plus the noncentrality parameter. If each ju; is equal 
to zero, then Y is y?(n,0) or, more simply, Y is y?(n) with mean n. 

The noncentral y?-variables, in which we have interest, are certain quadratic 
forms in normally distributed variables divided by a variance o?. In our exam- 


ple it is worth noting that the noncentrality parameter of 5°) X?/o?, which is 
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7) u?/07, may be computed by replacing each X; in the quadratic form by its 
mean pj, 7 = 1,2,...,n. This is no fortuitous circumstance; any quadratic form 
Q = Q(X%,...,Xn) in normally distributed variables, which is such that Q/o? 
x?(r, 0), has 0 = Q(p11, f2,--+,n)/o7; and if Q/o? is a chi-square variable (central 
or noncentral) for certain real values of j11, f2,...,{n, it is chi-square (central or 
noncentral) for all real values of these means. 

We next discuss the noncentral F-distribution. If U and V are independent and 
are, respectively, y?(r1) and y?(r2), the random variable F has been defined by 
F = r2U/r,V. Now suppose, in particular, that U is x?(r1,0), V is x?(r2), and 
U and V are independent. The distribution of the random variable r2U/1r1V is 
called a noncentral F-distribution with r; and rz degrees of freedom with non- 
centrality parameter 0. Note that the noncentrality parameter of F is precisely the 
noncentrality parameter of the random variable U, which is y?(r1, 0). To obtain the 
expectation of F’, use the E(U) in expression (9.3.3) and the derivation of the ex- 
pected value of a central F' given in expression (3.6.8). These together immediately 
imply that 


T2 Ty + 7 
OY eg | - (9.3.4) 
provided, of course, that rg > 2. If 9 > 0 then the quantity in brackets exceeds one 
and, hence, the mean of the noncentral F' exceeds the mean of the corresponding 
central F’. 

We next discuss the noncentral F’ distribution for the one-way ANOVA of the 
last section. 


Example 9.3.1 (Noncentrality Parameter for One-way ANOVA). Consider the 
one-way model with b levels, expression (9.2.1), with the hypotheses Hp : 4 = 
++ = pp versus Hy : yw; A fj for some j A j’. From expression (9.2.11), the F test 
statistic is F = [Q4/(b—1)]/[Q3/(n—b)]. In the denominator, the random variable 
Q3/o7 is y7(n — b) under the full model and, hence, in particular, under H;. It 
follows from Remark 9.8.3 of Section 9.8, though, that the distribution of Q4/o? is 
noncentral y?(b — 1,0) under the full model. Recall that 


1 b 
Q4/o? = oe milX aw 


Under the full model, E(X.;) =u; and E(X..) = yaa (rey) ps3: Calling this last 
expectation 7, we have from the above discussion that 


b 
aa M5 (9.3.5) 


If Ho is true then py; = 4, for some jz, and, hence, 7 = p. Thus, under Ho, 0 = 0. 
Under H,, there are distinct j and j’ such that uw; ¢ “,. In particular, then both 
fly and j1j- cannot equal 7, so 6 > 0. Therefore, under H; the expectation of F 
exceeds the null expectation. m 


Ale 
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There are R commands that compute the cdf of noncentral x? and F random 
variables. For example, suppose we want to compute P(Y < y), where Y has 
a x?-distribution with d degrees of freedom and noncentrality parameter b. This 
probability is returned with the command pchisq(y,d,b). The corresponding value 
of the pdf at y is computed by the command dchisq(y,d,b). As another exam- 
ple, suppose we want P(W > w), where W has an F-distribution with n1 and n2 
degrees of freedom and noncentrality parameter theta. This is computed by the 
command 1-pf(w,n1,n2,theta), while the command df(w,n1,n2,theta) com- 
putes the value of the density of W at w. Tables of the noncentral chi-square and 
noncentral F-distributions are available in the literature also. 


EXERCISES 


9.3.1. Let Y;, i = 1,2,...,n, denote independent random variables that are, re- 
spectively, y7(ri,6:), i= 1,2,...,n. Prove that Z = 77 Y; is x? (CT ri, oy &:)- 


9.3.2. Compute the variance of a random variable that is x?(r, 0). 


9.3.3. Three different medical procedures (A, B, and C) for a certain disease are 
under investigation. For the study, 3m patients having this disease are to be selected 
and m are to be assigned to each procedure. This common sample size m must be 
determined. Let (41,2, and pg, be the means of the response of interest under 
treatments A, B, and C, respectively. The hypotheses are: Hop : uy = M2 = M3 
versus H; : uw; A pj for some j # j’. To determine m, from a pilot study the 
experimenters use a guess of 30 of a? and they select the significance level of 0.05. 
They are interested in detecting the pattern of means: 2 = 4+5 and w3 = w,+10. 


(a) Determine the noncentrality parameter under the above pattern of means. 


(b) Use the R function pf to determine the powers of the F-test to detect the 
above pattern of means for m = 5 and m = 10. 


(c) Determine the smallest value of m so that the power of detection is at least 
0.80. 


(d) Answer (a)—(c) if o? = 40. 


9.3.4. Show that the square of a noncentral T random variable is a noncentral F 
random variable. 


9.3.5. Let X, and X2 be two independent random variables. Let X; and Y = 
X14 Xo be x?(r1, 61) and y?(r, 0), respectively. Here r, < r and 6; < 6. Show that 
X92 is x7 (r = T Is 6 — 61). 


9.4 Multiple Comparisons 


For this section, consider the one-way ANONA model with 6b treatments as de- 
scribed in expression (9.2.1) of Section 9.2. In that section, we developed the F-test 


526 Inferences About Normal Linear Models 


of the hypotheses of equal means, (9.2.2). In practice, besides this test, statisticians 
usually want to make pairwise comparisons of the form pj; — jj’. This is often called 
the Second Stage Analysis, while the F-test is consider the First Stage Anal- 
ysis. The analysis for such comparisons usually consists of confidence intervals for 
the differences uj; — fj and jz; is declared different from yy, if 0 is not in the 
confidence interval. The random samples for treatments j and j’ are: X1j;,...,Xnjj 
from the N(u;,07) distribution and Xy;/,...,Xn,,j- from the N(yj,07) distribu- 
tion, which are independent random samples. Based on these samples the estimator 
of fj — pyr is ne - X yt. Further in the one-way analysis, an estimator of a? is the 
full model estimator oe defined in expression (9.2.7). As discussed in Section 9.2, 
(n — b)o2¢/02 has a y2(n — b) distribution which is independent of all the sample 
means X.;. Hence, for a specified a it follows as in (4.2.13) of Chapter 4 that 


= —_ . 1 1 
X.; = Xj zie ta/2,n—bO2 —= + —<— (9.4.1) 
Nh; Ty! 


is a (1 — a)100% confidence interval for ju; — jj’. 

We often want to make many pairwise comparisons, though. For example, the 
first treatment might be a placebo or represent the standard treatment. In this case, 
there are b — 1 pairwise comparisons of interest. On the other hand, we may want 
to make all (3) pairwise comparisons. In making so many comparisons, while each 
confidence interval, (9.4.1), has confidence (1 — a), it would seem that the overall 
confidence diminishes. As we next show, this slippage of overall confidence is true. 
These problems are often called Multiple Comparison Problems (MCP). In 
this section, we present several MCP procedures. 


Bonferroni Multiple Comparison Procedure 


It is easy to motivate the Bonferroni Procedure while, at the same time, showing 
the slippage of confidence. This procedure is quite general and can be used in many 
settings not just the one-way design. So suppose we have k parameters 6; with 
(1 — a)100% confidence intervals [;, i = 1,...,4, where 0 < a < 1 is given. Then 
the overall confidence is P(0, € I1,...,9x € Ip). Using the method of complements, 
DeMorgan’s Laws, and Boole’s inequality, expression (1.3.7) of Chapter 1, we have 


PO EL,...,0%€I) = 1-P(ULO ¢ i) 
k 
> 1-S°P(¢h) =1-kea. (9.4.2) 
i=l 


The quantity 1 — ka is the lower bound on the slippage of confidence. For example, 
if k = 20 and a = 0.05 then the overall confidence may be 0. The Bonferroni 
procedure follows from expression (9.4.2). Simply change the confidence level of 
each confidence interval to [1 — (a@/k)]. Then the overall confidence is at least 1— a. 

For our one-way analysis, suppose we have k differences of interest. Then the 


9.4. Multiple Comparisons 527 


Bonferroni confidence interval for uj; — pj’ is 


_ —_ 1 1 
Xi — Xj Ht ) n—bO —+— 9.4.3 
j j a/(2k),n—b9Q nj + nj! ( ) 


While the overall confidence of the Bonferroni procedure is at least (1 — a), for a 
large number of comparisons, the lengths of its intervals are wide; i.e., a loss in 
precision. We offer two other procedures that, generally, lessen this effect. 

The R function mcpbon.R? computes the Bonferroni procedure for all pairwise 
comparisons for a one-way design. The call is mcpbon(y,ind,alpha=0.05) where 
y is the vector of the combined samples and ind is the corresponding treatment 
vector. See Example 9.4.1 below. 


Tukey’s Multiple Comparison Procedure 


To state Tukey’s procedure, we first need to define the Studentized range distri- 
bution. 


Definition 9.4.1. Let Y1,...,Y, be iid N(1,07). Denote the range of these vari- 
ables by R = max{Y;} —min{Y;}. Suppose mS?/o? has a x?(m) distribution which 
is independent of Y1,...,Y,%. Then we say that Q = R/S has a Studentized range 
distribution with parameters k and m. @ 


The distribution of Q@ cannot be obtained in close form but packages such as R 
have functions that compute the cdf and quantiles. In R, the call ptukey(x,k,m) 
computes the cdf of Q at x, while the call qtukey(p,k,m) returns the pth quantile. 

Consider the one-way design. First, assume that all the sample sizes are the 
same; i.e., for some positive integer a, nj = a, for all j = 1,...,6. Let R = 
Range{X.1—1,...,X.p—py}. Then since X.;—11,..., X.p—pp are iid N(0,07/a), 
the random variable Q = R/(GoQ//a) has a Studentized range distribution with 
parameters b and n — b. Let ge = q1—a,b,n—b- 


l-a = PQ< 4) =P (max{X.; — nj} — min{X.; — pj} < aeGa/Va) 
= P(|(uj — wy) — (X.5 — X-9/)| < GeG0/ Va, for all 3, 7’) 


If we expand the inequality in the last statement, we obtain the (1 —a)100% simul- 
taneous confidence intervals for all pairwise differences given by 


exe) 
C d1—a,b,n—b =; 
Ja 


The statistician John Tukey developed these simultaneous confidence intervals for 
the balanced case. For the unbalanced case, first write the error term in (9.4.4) as 


M—a,bn—b 5 Le 
J/2 Va a 


2Downloadable at the site listed in the Preface. 


X.5 = X51 a 


for all j,7’ in 1,.... (9.4.4) 
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For the unbalanced case, this suggests the following intervals 


= _ ob hob a 1 1 carne 
Xj — Xj + gedit —+—, forall j,j’ inl,...b. (9.4.5) 


v2 7 i ae 


This correction is due to Kramer and these intervals are often referred to as the 
Tukey-Kramer multiple comparison procedure; see Miller (1981) for discussion. 
These intervals do not have exact confidence (1 — a) but studies have indicated 
that if the unbalance is not severe the confidence is close to (1 — a); see Dunnett 
(1980). Corresponding R code is shown in Example 9.4.1. 


Fisher’s PLSD Multiple Comparison Procedure 


The final procedure we discuss is Fisher’s Protected Least Significance Dif- 
ference (PLSD). The setting is the general (unbalanced) one-way design (9.2.1). 
This procedure is a two-stage procedure. It can be used for an arbitrary umber of 
comparisons but we state it for all comparisons. For a specified level of significance 
a, Stage 1 consists of the F-test of the hypotheses of equal means, (9.2.2). If the test 
rejects at level a then Stage 2 consists of the usual pairwise (1 — a)100% confidence 
intervals, i.e., 


= —_ 1 1 
X.j — Xj £ te/2mn—v604/— +—, for all j,j’ in 1,...,0. (9.4.6) 
Ny Ty! 


If the test in Stage 1 fails to reject, users sometimes perform Stage 2 using the 
Bonferroni procedure. Fisher’s procedure does not have overall coverage 1 — a, but 
the initial F-test offers protection. Simulation studies have shown that Fisher’s 
procedure performs well in terms of power and level; see, for instance, Carmer and 
Swanson (1973) and McKean et al. (1989). The R function? mcpfisher .R computes 
this procedure as discussed in the next example. 


Example 9.4.1 (Fast Cars). Kitchens (1997) discusses an experiment concern- 
ing the speed of cars. Five cars are considered: Acura (1), Ferrari (2), Lotus (3), 
Porsche (4), and Viper (5). For each car, 6 runs were made, 3 in each direction. For 
each run, the speed recorded is the maximum speed on the run achieved without 
exceeding the engine’s redline. The data are in the file fastcars.rda. Figure 9.4.1 
displays the comparison boxplots of the speeds versus the cars, which shows clearly 
that there are differences in speed due to the car. Ferrari and Porsche seem to be 
the fastest but are the differences significant? We assume the one-way design (9.2.1) 
and use R to do the computations. Key commands and corresponding results are 
given next. The overall F-test of the hypotheses of equal means, (9.2.2), is quite 
significant: F’ = 25.15 with the p-value 0.0000. We selected the Tukey MCP at level 
0.05. The command below returns all ea = 10 pairwise comparisons, but in our 
summary we only list two. 
HEE Code assumes that fastcars.rda has been loaded in R 


3 Down loadable at the site listed in the Preface. 
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> fit <- lm(speed*factor (car) ) 

> anova(fit) 

### F-Stat and p-value 25.145 1.903e-08 
> aovfit <- aov(speed~ factor (car) ) 

> TukeyHSD (aovfit) 


## Tukey's procedures of all pairwise comparisons are computed. 
## Summary of a pertinent few 


## Cars Mean-diff LB CI UB CI Sig?? 
## Porsche - Ferrari -2.6166667 -9.0690855 3.835752 NS 
## Viper - Porsche -7.7333333 -14.1857522 -1.280914 Sig. 


## Bonferroni 

> mcpbon (speed, car) 

## Porsche - Ferrari -2.6166667 -9.3795891 4.1462558 NS 
## Viper - Porsche -7.7333333 -14.496255 -0.9704109 Sig. 
2.197038 6.762922 0.9704109 14.49625578 


## Fisher 

> mcpfisher (speed, car) 

## ftest 2.514542e+01 1.903360e-08 

## Porsche - Ferrari -2.6166667 -7.141552 1.908219 NS 

## Viper - Porsche -7.7333333  -12.258219 -3.208448 Sig. 
For discussion, we cite only two of Tukey’s confidence intervals. As the second in- 
terval in the above printout shows, the mean speeds of both the Ferrari and Porsche 
are significantly faster than the mean speeds of the other cars. The difference be- 
tween the Ferrari’s and Porsche’s mean speeds, though, is insignificant. Below the 
two Tukey confidence intervals, we display the results based on the Bonferroni and 
Fisher procedures. Note that all three procedures result in the same conclusions for 
these comparisons. The Bonferroni intervals are slightly larger than those of the 
Tukey procedure. The Fisher procedure gives the shortest intervals as expected. m 


In practice, the Tukey-Kramer procedure is often used, but there are many other 
multiple comparison procedures. A classical monograph on MCPs is Miller (1981) 
while Hus (1996) offers a more recent discussion. 


EXERCISES 


9.4.1. For the study discussed in Exercise 9.2.8, obtain the results of Bonferroni 
multiple comparison procedure using @ = 0.10. Based on this procedure, which 
brand of fuel if any is significantly best? 


9.4.2. For the study discussed in Exercise 9.2.6, compute the Tukey-Kramer pro- 
cedure. Are there any significant differences? 


9.4.3. Suppose X and Y are discrete random variables that have the common 
range {1,2,...,k}. Let pi; and po; be the respective probabilities P(X = j) and 
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Speed 


Acura Ferrair Lotus Porsche Viper 


Figure 9.4.1: Boxplot of car speeds cited in Example 9.4.1. 


P(Y = 7). Let X1,...,Xn, and Yi,...,¥n, be respective independent random 
samples on X and Y. The samples are recorded in a 2 x k contingency table of 
counts O,;, where O1; = #{X; = j} and Oo; = #{Y; = j}. In Example 4.7.3, 
based on this table, we discussed a test that the distributions of X and Y are the 
same. Here we want to consider all the differences p,; — pa; for 7 = 1,...,k. Let 
Dij = Og) hy, 

(a) Determine the Bonferroni method for performing all these comparisons. 

(b) Determine the Fisher method for performing all these comparisons. 


9.4.4. Suppose the samples in Exercise 9.4.3 resulted in the contingency table: 


Pp fit 2|3}4) 5] 6] 7] 81] 9 | 10) 


To compute (in R) the confidence intervals below, use the command prop.test as 
in Example 4.2.5. 


(a) Based on the Bonferroni procedure for all 10 comparisons, compute the con- 
fidence interval for pig — p¢. 


(b) Based on the Fisher procedure for all 10 comparisons, compute the confidence 
interval for pig — p6- 


9.4.5. Write an R function that computes the Fisher procedure of Exercise 9.4.3. 
Validate it using the data of Exercise 9.4.4. 
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9.4.6. Extend the Bonferroni procedure to simultaneous testing. That is, suppose 
we have m hypotheses of interest: Ho; versus Hy;, 7 = 1,...,m. For testing Ho; 
versus H1;, let Cj,q be a critical region of size a and assume Ho, is rejected if 
X; € Cia, for a sample X;. Determine a rule so that we can simultaneously test 
these m hypotheses with a Type I error rate less than or equal to a. 


9.5 Two-Way ANOVA 


Recall the one-way analysis of variance (ANOVA) problem considered in Section 
9.2 which was concerned with one factor at b levels. In this section, we are con- 
cerned with the situation where we have two factors A and B with levels a and 
b, respectively. This is called a two-way analysis of variance (ANOVA). Let 
Xjij, ¢ = 1,2,...,a@ and j = 1,2,...,b, denote the response for factor A at level 
i and factor B at level 7. Denote the total sample size by n = ab. We shall assume 
that the X;;s are independent normally distributed random variables with common 
variance 07. Denote the mean of Xi; by wiz. The mean pj; is often referred to as 
the mean of the (i, 7)th cell. For our first model, we consider the additive model 
where 


Mi = E+ (B -BE)+ (8; - 2); (9.5.1) 
that is, the mean in the (7, 7)th cell is due to additive effects of the levels, i of factor 
A and j of factor B, over the average (constant) 7. Let a; = 7Z;. -—Z,i=1,...,a; 
6; =; — 2, j =1,...,b; and 4 = 7. Then the model can be written more simply 
as 

Mig = B+ + Gj, (9.5.2) 


where )7“_, a; = 0 and es) 3; = 0. We refer to this model as being a two-way 
additive ANOVA model. 

For example, take a = 2,b = 3, w= 5, ay = 1, ag = —1, 2, = 1, Go = 0, and 
(33 = —1. Then the cell means are 


Factor B 


Factor A 1 Mi =7 fn2=6 wWi3=5 
2} fa =5 pa2=4 plox3=3 


Note that for each 7, the plots of p44; versus j are parallel. This is true for additive 
models in general; see Exercise 9.5.9. We call these plots mean profile plots. 
Had we taken (, = G2 = (03 = 0, then the cell means would be 


Factor B 
1 2, 3 


Factor A 1 Mi1=6 j[n2=6 pi3=6 
2) Moai =4 po2=4 poz =4 


The hypotheses of interest are 


Apa: Q, =+++ =Qq =0 versus Hi4: a; #0, for some i, (9.5.3) 
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and 
Hop: $1 =--- = 6) =0 versus Hip: 3; #0, for some j. (9.5.4) 


If Ho is true, then by (9.5.2) the mean of the (2, 7)th cell does not depend on the 
level of A. The second example above is under Hog. The cell means remain the 
same from column to column for a specified row. We call these hypotheses main 
effect hypotheses. 


Remark 9.5.1. The model just described, and others similar to it, are widely used 
in statistical applications. Consider a situation in which it is desirable to investigate 
the effects of two factors that influence an outcome. Thus the variety of a grain 
and the type of fertilizer used influence the yield; or the teacher and the size of the 
class may influence the score on a standardized test. Let X;; denote the yield from 
the use of variety 7 of a grain and type j of fertilizer. A test of the hypothesis that 
By, = Gg =--- = By = 0 would then be a test of the hypothesis that the mean yield 
of each variety of grain is the same regardless of the type of fertilizer used. m 


Call the model described around expression (9.5.2) the full model. We want to 
determine the mles. If we write out the likelihood function, the summation in the 


exponent of e is 
a b 
SS= S00 (ei3 —B- a — 8). 


i=1 j=l 


The mles of a;, 3;, and ~@ minimize SS. By adding in and subtracting out, we 
obtain: 


a b 
SS=S°S° {{@.. - A -lai — (F. — B..)|-[8; — (Bj - FJ + [iy — Fi. — Bj + FJ}. 


i=1 j=1 


From expression (9.5.2), we have >7; ai = )7; 8; = 0. Further, 


and 


b 
S (ri — Xj. — 2.5 +T..) = So (ris Gis D9 +f..) = 0. 


i=l 


= 


a. 


Therefore, in the expansion of the sum of squares, (9.5.5), all cross product terms 
are 0. Hence, we have the identity 


a b 
Ss = ab[z.. — ZI? +b) Now — (i — B.)) +a D185 — (By —B.)? 


a b 
+30 So [xij = Lie Bg +2... (9.5.6) 


i=1 j=1 
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Since these are sums of squares, the minimizing values, (mles), must be 


Note that we have used random variable notation. So these are the maximum 
likelihood estimators. It then follows that the maximum likelihood estimator of o? 
is 
a b > > > 
oo Let ie Ae ae al % 


where we have defined the numerator of 6% as the quadratic form Q5. It follows 
from an advanced course in linear models that aba2,/o7 has a y?((a — 1)(b — 1)) 
distribution. 

Next we construct the likelihood ratio test for Hog. Under the reduced model 
(full model constrained by Hog), 8; = 0 for all j = 1,...,b. To obtain the mles for 
the reduced model, the identity (9.5.6) becomes 


SS = abfe.—F?+ bd lai — (%;. — B..)|? 
b a a b 
+a) Ry -@.P +) 0S [ay - 3. - Fj +2. (9.5.9) 


i=l j=1 


Thus the mles for a; and 7 remain the same as in the full model and the reduced 
model maximum likelihood estimator of o? is 


{or aX 4 ate eta Ft x. 
ab 


67 = (9.5.10) 
Denote the numerator of 6? by Q’. Note that it is the residual variation left after 
fitting the reduced model. 

Let A denote the likelihood ratio test statistic for Hog. Our derivation is similar 
to the derivation for the likelihood ratio test statistic for one-way ANOVA of Section 
9.2. Hence, similar to equation (9.2.9), our likelihood ratio test statistic simplifies 
uO a2 / 

ja/2 — 92 _ Qa 
BQ! 
Then, similar to the one-way derivation, the likelihood ratio test rejects Hog for 
large values of Q!,/Q%, where in this case, 


b 
=a) [e;—s.[. (9.5.11) 
j=l 


Note that Q4, = Q’ — Q5; ie., it is the incremental increase in residual variation if 
we use the reduced model instead of the full model. 

To obtain the null distribution of Q/,, notice that it is the numerator of the sample 
variance of the random variables /aX.1,...,\/aX.y. These random variables are 
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independent with the common N(,/a7, 07) distribution; see Exercise 9.5.2. Hence, 
by Theorem 3.6.1, Q’,/o? has y?(b — 1) distribution. In a more advanced course, it 
can be further shown that Q/, and Q are independent. Hence, the statistic 


ay age X=) 
Why Vai ep — Xe Xp + XP /(e- YO-1) 


has an F'(b — 1,(a—1)(b—1)) under Hog. Thus, a level a test is to reject Hog in 
favor of Hyp if 


Fz = (9.5.12) 


Fp > F(a,b—1,(a—1)(b—1)). (9.5.13) 


If we are to compute the power function of the test, we need the distribution of 
Fg when Hog is not true. As we have stated above, Q}/o7, (9.5.8), has a central 
x?-distribution with (a — 1)(b — 1) degrees of freedom under the full model, and, 
hence, under Hig. Further, it can be shown that Q4, (9.5.11), has a noncentral y?- 
distribution with b—1 degrees of freedom under H,g. To compute the noncentrality 
parameters of Q/,/o? when Hj, is true, we have E(X;;) = w+a; + Bj, E(X;.) = 
u+a;, E(X;) = w+ 8;, and E(X_) = pw. Using the general rule discussed in 
Section 9.4, we replace the variables in Q/,/o? with their means. Accordingly, the 
noncentrality parameter Q’/,/o? is 


b b 
H+ Bj — Ww)? = =f. 


Thus, if the hypothesis Hog is not true, F has a noncentral F-distribution with b—1 
and (a — 1)(b— 1) degrees of freedom and noncentrality parameter a So 63 /o?. 

A similar argument can be used to construct the likelihood ratio test statistics 
Fy, to test Hoa versus Hy,, (9.5.3). The numerator of the F test statistic is the 
sum of squares among rows. The test statistic is 


ee) (9.5.14) 
i jai LX; = Nie = Xj + X..]?/(a _ 1)(b ~ 1) 


and it has an F(a — 1, (a — 1)(b— 1)) distribution under Hoa. 


9.5.1 Interaction between Factors 


The analysis of variance problem that has just been discussed is usually referred 
to as a two-way classification with one observation per cell. Each combination of i 
and j determines a cell; thus, there is a total of ab cells in this model. Let us now 
investigate another two-way classification problem, but in this case we take c > 1 
independent observations per cell. 

Let Xijk, 7 = 1,2,...,a, 7 = 1,2,...,6, and k = 1,2,...,c, denote n = abc 
random variables that are aadcpendont and have normal distributions with common, 
but unknown, variance 0”. Denote the mean of each Xijk, K=1,2,...,0, by pay. 
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Under the additive model, (9.5.1), the mean of each cell depended on its row and 
column, but often the mean is cell-specific. To allow this, consider the parameters 


Vig = big —{U+ (i. — vw) + (B53 - )t 
= big — By. — Fag +B, 
fori =1,...a,j7 =1,...,b. Hence yj; reflects the specific contribution to the cell 


mean over and above the additive model. These parameters are called interaction 
parameters. Using the second form (9.5.2), we can write the cell means as 


fig = Utat+ 6; +%5; (9.5.15) 


where )-y_, a; = 0, sm 8; = 0, and 1% = aa Vij = 0. This model is 
called a two-way model with interaction. 

For example, take a = 2, b = 3, w= 5, ay lag=-1,4=1,% =0, 
03 =—-1,9u =1, y2 = 1, 13 = —2, yar = —1, Yoo = —1, and y23 = 2. Then the 
cell means are 


Factor B 
1 2 3 
Factor A 1 Mi1=8 ng =7 W3=3 
2] v2 =4 po, =3 pag = 5 
If each 7; = 0, then the cell means are 


Factor B 
1 2 3 
[ FactorA 1[uu=7 pwie=—6 pi3—5 | 
2} fa =5 pa2=4 po3=3 


Note that the mean profile plots for this second example are parallel, but those in 
the first example (where interaction is present) are not. 

The derivation of the mles under the full model, (9.5.15), is quite similar to the 
derivation for the additive model. Letting S'S denote the sums of squares in the 
exponent of e in the likelihood function, we obtain the following identity by adding 
in and subtracting out (we have omitted subscripts on the sums): 


SS = SSDS o (wise — w— 04 — By — Vig)? 

So {lege -33-]—[u-2...] - [os — (. - B...) - [Bj — (By. - ...)] 
lng — (By. — Bi. — By +B...) 

SOS Sc leisn — Fiy-P + abclu — F...P + be Slay — (F;.. — F...))? 4 
ac) (8; — (Bj. -B..) Pte) Sly — Gy. Bi. -B yj. +B...) PP (9.5.16) 


where, as in the additive model, the cross product terms in the expansion are 0. 
Thus, the mles of j, a; and (§; are the same as in the additive model; the mle of 
Vig is Ves = Xi = X j.. = Xi + X...; and the mle of a? is 


LU Xige — Key)? (9.5.17) 


I 


l| 
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Let Q% denote the numerator of 67. 
The major hypotheses of interest for the interaction model are 


Hoag: Vij = 0 for all i,j versus Hy4p: yj #0, for some i, j. (9.5.18) 


Substituting 7;; = 0 in SS, it is clear that the reduced model mle of o? 


52 — Dede dal Xaje — Xa P ted Ve Ky. — Xe. — Xj + XP (9.5.19) 
# abc 

Let Q” denote the numerator of 62, and let Q’/ = Q” —Q4. Then it follows as in the 

additive model that the likelihood ratio test statistic rejects Hog for large values 

of QY/Q%. In a more advanced class, it is shown that the standardized test statistic 


Ayes ee (9.5.20) 


Q3/[ab(e — 1)] 


has under Hoag an F-distribution with (a — 1)(b— 1) and ab(c — 1) degrees of 
freedom. 

If HoaB : Vij = 0 is accepted, then one usually continues to test a; = 0, 1 = 
1,2,...,a, by using the test statistic 


em )?/(a—1) 


—— 
dd De ize — Xiy.)?/lab(e — 1)] 
i=1 k=1 


=1 j=lk= 


F= 


et 


which has a null F-distribution with a—1 and ab(c—1) degrees of freedom. Similarly, 
the test of 8; =0, 7 =1,2,...,6, proceeds by using the test statistic 


Dies )?/(b—1) 


P= 


a (64 


b 
> x Ree — Xij;.)?/[ab(e — 1)] 
i=l j=l k=1 


=1 j=1 k= 


which has a null F-distribution with 6 — 1 and ab(c — 1) degrees of freedom. 
We conclude this section with an example that serves as an illustration of two- 
way ANOVA along with its associated R code. 


Example 9.5.1. Devore (2012), page 435, presents a study concerning the effects 
to the thermal conductivity of an asphalt mix due to two factors: Binder Grade 
at three different levels (PG58, PG64, and PG70) and Coarseness of Aggregate 
Content at three levels (38%, 41%, and 44%). Hence, there are 3 x 3 = 9 different 
treatments. The responses are the thermal conductivities of the mixes of asphalt at 
these crossed levels. Two replications were performed at each treatment. The data 
are: 
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Binder-Grade | 38% | 41% 44% 
PG58 0.835 | 0.822 0.785 
0.845 | 0.826 0.795 
PG64 0.855 | 0.832 0.790 
0.865 | 0.836 0.800 
PG70 0.815 | 0.800 0.770 
0.825 | 0.820 0.790 


The data are also in the file conductivity.rda. Assuming this file has been loaded 
into the R work area, the mean profile plot is computed by 

interaction. plot (Binder , Aggregate ,Conductivity, legend=T) 
and it is displayed in Figure 9.5.1. Note that the mean profiles are almost parallel, 
a graphical indication of little interaction between the factors. The ANOVA for 
the study is computed by the following two commands. It yields the tabled results 
(which we have abbreviated). The next to last column shows the F-test statistics 
discussed in this section. 

fit=lm(Conductivity ~ factor(Binder) + factor(Aggregate) + 

factor (Binder) *factor (Aggregate) ) 

anova(fit) 

Analysis of Variance Table 

Df Sum Sq F value Pr (>F) 

factor (Binder) 2 0.0020893 14.1171 0.001678 

factor (Aggregate) 2 0.0082973 56.0631 8.308e-06 

factor (Binder) :factor(Aggregate) 4 0.0003253 1.0991 0.413558 
As the interaction plot suggests, interaction is not significant (p = 0.4135). In prac- 
tice, we would accept the additive (no interaction) model. The main effects are 
both highly significant. So both factors have an effect on conductivity. See Devore 
(2012) for more discussion. m 


EXERCISES 


9.5.1. For the two-way interaction model, (9.5.15), show that the following decom- 
position of sums of squares is true: 


a 


a b c b 
SY) Cas -X...)? =be > (Xi. -X..)?+ ac (Ky. -X..)° 


i=1 j=1 k=1 i=l ra 


- cy So (Xiy. — X.. -—X 54+ X..)? 


that is, the total sum of squares is decomposed into that due to row differences, 
that due to column differences, that due to interaction, and that within cells. 
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mean of Conductivity 
0.82 
1 


PG58 PG64 PG70 


Binder 


Figure 9.5.1: Mean profile plot for the study discussed in Example 9.5.1. The 
profiles are nearly parallel, indicating little interaction between the factors. 


9.5.2. Consider the discussion above expression (9.5.14). Show that the random 
variables /aX.1,...,./aX.) are independent with the common N(,/af, 0?) distri- 
bution. 


9.5.3. For the two-way interaction model, (9.5.15), show that the noncentrality 
parameter of the test statistic F4p is equal to oe 4 cae 

9.5.4. Using the background of the two-way classification with one observation per 
cell, determine the distribution of the maximum likelihood estimators of a;, 3;, and 
[LL 

9.5.5. Prove that the linear functions X;; — Xe Xj +X. and XxX 5 —X. are 
uncorrelated, under the assumptions of this section. 


9.5.6. Given the following observations associated with a two-way classification 
with a = 3 and b = 4, use R or another statistical package to compute the F- 
statistic used to test the equality of the column means (3, = (2 = (3 = 34 = 0) 
and the equality of the row means (a, = a2 = a3 = 0), respectively. 


Row/Column 1 2 3 4 


1 3.1 4.2 2.7 49 
2 2.7 2.9 1.8 3.0 
3 4.00 46 3.0 3.9 


9.5.7. With the background of the two-way classification with c > 1 observations 
per cell, determine the distribution of the mles of a;, 3;, and 7;. 
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9.5.8. Given the following observations in a two-way classification with a = 3, 
b = 4, and c = 2, compute the F-statistics used to test that all interactions are 
equal to zero (7; = 0), all column means are equal (G; = 0), and all row means 
are equal (a; = 0), respectively. Data are in the form 2;;,,i,7 in the data set 
sec951.rda. 


Row/Column = 1 2 3 4 


1 3.1 4.2 2.7 4.9 
2.9 4.9 3.2 4.5 
2 2.7 2.9 1.8 3.0 
2.9 2.3 2.4 3.7 
3 4.0 46 3.0 3.9 


44 50 2.5 4.2 


9.5.9. For the additive model (9.5.1), show that the mean profile plots are parallel. 
The sample mean profile plots are given by plotting X;;. versus j, for each 7. These 
offer a graphical diagnostic for interaction detection. Obtain these plots for the last 
exercise. 


9.5.10. We wish to compare compressive strengths of concrete corresponding to 
a = 3 different drying methods (treatments). Concrete is mixed in batches that 
are just large enough to produce three cylinders. Although care is taken to achieve 
uniformity, we expect some variability among the b = 5 batches used to obtain the 
following compressive strengths. (There is little reason to suspect interaction, and 
hence only one observation is taken in each cell.) Data are also in the data set 
sec95set2.rda. 


Batch 
Treatment By, Bg Bz By Bs 
Ay 52 AT 44 S1 42 
Ag 60 55 49 52 43 
A3 56 48 45 44 38 


(a) Use the 5% significance level and test H4 : a, = a2 = a3 = 0 against all 
alternatives. 


(b) Use the 5% significance level and test Hp : 3, = G2 = P3 = B4 = Bs = 0 
against all alternatives. 


9.5.11. With a = 3 and 6 = 4, find pai, 4; and yj if wiz, for i = 1,2,3 and 
j =1,2,3,4, are given by 


6 7 7 12 
10 3 11 8 
8 5 9 10 


9.6 A Regression Problem 


There is often interest in the relationship between two variables, for example, a 
student’s scholastic aptitude test score in mathematics and this same student’s 
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grade in calculus. Frequently, one of these variables, say x, is known in advance 
of the other and there is interest in predicting a future random variable Y. Since 
Y is a random variable, we cannot predict its future observed value Y = y with 
certainty. Thus let us first concentrate on the problem of estimating the mean 
of Y, that is, E(Y). Now E(Y) is usually a function of x; for example, in our 
illustration with the calculus grade, say Y, we would expect E'(Y) to increase with 
increasing mathematics aptitude score x. Sometimes E(Y) = (a) is assumed to 
be of a given form, such as a linear or quadratic or exponential function; that is, 
u(x) could be assumed to be equal to a+ Bx or a+ Ba + yx? or ae’. To estimate 
E(Y) = u(x), or equivalently the parameters a, 3, and y, we observe the random 
variable Y for each of n possible different values of x, say %1,%2,...,%n, which are 
not all equal. Once the n independent experiments have been performed, we have 
n pairs of known numbers (21, y1), (v2, y2),---;(Ln, Yn). These pairs are then used 
to estimate the mean E(Y). Problems like this are often classified under regression 
because E(Y) = u(x) is frequently called a regression curve. 


Remark 9.6.1. A model for the mean such as a + Gx + ya? is called a linear 
model because it is linear in the parameters a, 3, and y. Thus ae®* is not a linear 
model because it is not linear in a and GZ. Note that, in Sections 9.2 to 9.5, all the 
means were linear in the parameters and hence are linear models. m 


For the most part in this section, we consider the case in which E(Y) = p(x) is 
a linear function. Denote by Y; the response at x; and consider the model 


Y¥,=a+(P(a;-T)+e;,, it=1,...,n, (9.6.1) 


where F = n7' 0", a and e,...,€, are iid random variables with a common 
N(0,o7) distribution. Hence E(Y;) = a+ B(a; — %), Var(Y;) = 07, and Y; has 
N(a+(x;—), 07) distribution. The major assumption is that the random errors, 
e;, are iid. In particular, this means that the errors are not a function of the 
x;’s. This is discussed in Remark 9.6.3. First, we discuss the maximum likelihood 
estimates of the parameters a, 3, and o. 


9.6.1 Maximum Likelihood Estimates 


Assume that the n points (a1, Y1), (v2, Y2),.--,(@n, Yn) follow Model 9.6.1. So the 
first problem is that of fitting a straight line to the set of points; i.e., estimating 
a and Z. As an aid to our discussion, Figure 9.6.1 shows a scatterplot of 60 
observations (21, y1),---, (260, Yeo) simulated from a linear model of the form (9.6.1). 
Our method of estimation in this section is that of maximum likelihood (mle). The 
joint pdf of Yj,..., Yn is the product of the individual probability density functions; 
that is, the likelihood function equals 


La, 8,02) = I ao f Se 
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Figure 9.6.1: The plot shows the least squares fitted line (solid line) to a set of 
data. The dashed-line segment from («;, §;) to (xj, yi) Shows the deviation of (x;, yi) 
from its fit. 


To maximize L(a, 3,07), or, equivalently, to minimize 


Dinilyi — @ = B(x = 7)? 


= log L(a, 8,07) = 5 log(20”) + Qo2 


we must select a and 3 to minimize 

n 

H(a, 8) = > [yi — a — B(x — B))’. 

i=1 
Since |y; — a — B(x; — %)| = |yi — (x,)| is the vertical distance from the point 
(xi, yi) to the line y = pu(x) (see the dashed-line segment in Figure 9.6.1), we note 
that H(a,) represents the sum of the squares of those distances. Thus, selecting 
a and @ so that the sum of the squares is minimized means that we are fitting the 
straight line to the data by the method of least squares (LS). 

To minimize H(a, 3), we find the two first partial derivatives, 


n 


oe) 2d hs a — B(x; —B)|(-1) 
and 
OH (a, 3) - 
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Setting OH (a, B)/Oa = 0, we obtain 


dou — ne - > / (ae - 2) =0. (9.6.2) 
w=1 t=1 


Since S0"_, (x; —%) = 0, the equation becomes >", y; — na = 0; hence, the mle of 
a is 
e=¥. (9.6.3) 
The equation 0H (a, 3)/03 = 0 yields, with a replaced by J, 


n 


divi — (ei — B) — BY (ei — 7)? = 0 (9.6.4) 


i=1 

and, hence, the mle of @ is 

Tha -Yyei-z) _ Oh Ye -2) 
viet (x; — 2)? iat (x; — 2)? 


Equations (9.6.2) and (9.6.4) are the estimating equations for the LS solutions for 
this simple linear model. 
The fitted value at the point (x;,y;) is given by 


B= (9.6.5) 


hi = 4+ B(xi -2), (9.6.6) 


which is shown on Figure 9.6.1. The fitted value %; is also called the predicted 
value of y; at x;. The residual at the point (a;,y;) is given by 


€i = Yi — Yi (9.6.7) 


which is also shown on Figure 9.6.1. Residual means “what is left” and the residual 
in regression is exactly that, i.e., what is left over after the fit. The relationship 
between the fitted values and the residuals are explored in Remark 9.6.3 and in 
Exercise 9.6.13. 
To find the maximum likelihood estimator of o”, consider the partial derivative 
O[-log L(a,B,o*)|_ nm Vaio — Bei — B)P 


O(c) ~ Qo? 2(02)? 
Setting this equal to zero and replacing a and ( by their solutions @ and B, we 
obtain 


=) l¥;-&- B(x; -7Z)). (9.6.8) 


Of course, due to the invariance of mles, ¢ = VG?. Note that in terms of the resid- 
uals, 6? = n7! S70", é?. As shown in Exercise 9.6.13, the average of the residuals 
is 0. 
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Since @ is a linear function of independent and normally distributed random 
variables, & has a normal distribution with mean 


na) =2(29-x) = =) E(%) == la + Bai -7)] =0 


and variance 


The estimator B is also a linear function of Yi, Y2,...,Y;, and hence has a normal 
distribution with mean 


doiai (41 — Flo + B(xi — 7) 


E(8) = oa (2; = 2 


and variance 


OOo OO a 


[ot (ei — 2)?]" yoy (@% — BF)? 


In summary, the estimators @ and B are linear functions of the independent 
normal random variables Yj,...,Y;,. In Exercise 9.6.4 it is further shown that the 
covariance between @ and B is zero. It follows that @ and B are independent random 
variables with a bivariate normal distribution; that is, 


« i 0 
© )hasa No(( % ),02] ® distribution. (9.6.9) 
B B 0 esy nae 


Next, we consider the estimator of o?. It can be shown (Exercise 9.6.9) that 


n n 


YoM-e-Ba-zP? = > {(@-2a)+ 6-P)\a -7) 


i=l i=1 


or for brevity, 
Q=Q1+ Q24+ Qs. 


Here Q, Q1, Q2, and Q3 are real quadratic forms in the variables 


Y; -a—P(a,-F), t=1,2,...,n 
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In this equation, Q represents the sum of the squares of n independent random 
variables that have normal distributions with means zero and variances 0”. Thus 
Q/o? has a x? distribution with n degrees of freedom. Each of the random variables 
Vn(@ — a)/o and \/X<™, (@ — ¥)?(8 — 8)/o has a normal distribution with zero 
mean and unit variance; thus, each of Q;/o? and Q2/o? has a y? distribution with 
1 degree of freedom. In accordance with Theorem 9.9.2 (proved in Section 9.9), 
because @3 is nonnegative, we have that Q),Q2, and Qs3 are independent and that 
Q3/o7 has a y? distribution with n — 1 — 1 = n — 2 degrees of freedom. That is, 
no? /o? has a x? distribution with n — 2 degrees of freedom. 

We now extend this discussion to obtain inference for the parameters a and (. 
It follows from the above derivations that both the random variable T) 


7, — [vale-a)iio __a-a 


VQ3/|o*(n — 2)] 6? /(n — 2) 


and the random variable T> 


pail Tee FG - 8) /o 6-8 aes) 
Qs/lo7(n = 2)] nd? /[n— 2) 1 (i — 2) 


have a t-distribution with n — 2 degrees of freedom. These facts enable us to obtain 
confidence intervals for a and 3; see Exercise 9.6.5. The fact that no?/o? has a 
x? distribution with n — 2 degrees of freedom provides a means of determining a 
confidence interval for o?. These are some of the statistical inferences about the 


parameters to which reference was made in the introductory remarks of this section. 


Remark 9.6.2. The more discerning reader should quite properly question our 
construction of JT, and T2 immediately above. We know that the squares of the 
linear forms are independent of Q3 = no?, but we do not know, at this time, 
that the linear forms themselves enjoy this independence. A more general result is 
obtained in Theorem 9.9.1 of Section 9.9 and the present case is a special instance. 
| 


Before considering a numerical example, we discuss a diagnostic plot for the 
major assumption of Model 9.6.1. 


Remark 9.6.3 (Diagnostic Plot Based on Fitted Values and Residuals). The major 
assumption in the model is that the random errors €),...,€, are iid. In particular, 
this means that the errors are not a function of the 7;’s so that a plot of e; versus a+ 
G(a;—Z) should result in a random scatter. Since the errors and the parameters are 
unknown this plot is not possible. We have estimates, though, of these quantities, 
namely the residuals é; and the fitted values 7;. A diagnostic for the assumption is 
to plot the residuals versus the fitted values. This is called the residual plot. If the 
plot results in a random scatter, it is an indication that the model is appropriate. 
Patterns in the plot, though, are indicative of a poor model. Often in this later 
case, the patterns in the plot lead to better models. 
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As a final note, in Model 9.6.1 we have centered the ’s; i.e., subtracted % from 
x;. In practice, usually we do not precenter the x’s. Instead, we fit the model 
yi = a* + Bx; + e;. In this case, the least squares, and hence, mles minimize the 


sum of squares 
n 


Soi —a* — Bx;)?. (9.6.11) 
i=1 
In Exercise 9.6.1, the reader is asked to show that the estimate of 3 remains the 
same as in expression (9.6.5), while @* = 7 — 6%. We use this noncentered model 
in the following example. 


Example 9.6.1 (Men’s 1500 meters). As a numerical illustration, consider data 
drawn from the Olympics. The response of interest is the winning time of the men’s 
1500 meters, while the predictor is the year of the olympics. The data were taken 
from Wikipedia and can be found in olym1500mara.rda. Assume the R vectors 
for the winning times and year are time and year, respectively. There are n = 27 
data points. The top panel of Figure 9.6.2 shows a scatterplot of the data that is 
computed by the R command 

par (mfrow=c(2,1));plot(time~year,xlab="Year",ylab="Winning time") 
The winning times are steadily decreasing over time and, based on this plot, a sim- 
ple linear model seems reasonable. Obviously the time for 2016 is an outlier but it 
is the correct time. Before proceeding to inference, though, we check the quality 
of the fit of the model. The following R commands obtain the least squares fit, 
overlaying it on the scatterplot in Figure 9.6.2, the fitted values, and the residuals. 
These are used to obtain the residual plot that is displayed in the bottom panel of 
9.6.2. 

fit <- lm(time“year); abline(fit) 

ehat <- fit$resid; yhat <- fit$fitted.values 

plot (ehat~yhat,xlab="Fitted values", ylab="Residuals") 
Recall a “good” fit is indicated by a random scatter in the residual plot. This does 
not appear to be the case. There is a dependence* between adjacent points over 
time. This dependence is apparent from the scatterplot too. In a time series course, 
this dependence would be investigated. 

Based on the dependence, the following inference is approximate. The command 
summary (fit) produces the table of coefficients: 
Estimate Std. Error t value Pr(>t|) | 

(Intercept) 12.325411 1.039402 11.858 9.26e-12 

year -0.004376 0.000530 -8.257 1.31e-08 
Hence, the prediction equation is y = 12.33—.0044year. Based on the slope estimate, 
we predict the winning time to drop by 0.004 minutes every year. For a 95% 
confidence interval for the slope, the t-critical value via R is qt(.975,25) which 
computes to 2.060. Using the standard error in the summary table, the following R 
commands compute confidence interval for the slope parameter: 

err=0.000530*2.060;1b=-0.004376-err ; ub=-0.004376+err; ci=c(1b, ub) 


“This dependence is not surprising. The runners race against each other but they also try to 
beat the Olympic record. 
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ci; -0.0054678 -0.0032842 
So with approximate confidence 95%, we estimate the drop in winning time to 
between 0.0032 to 0.0055 minutes per year. 


Based on the fit, the predicted winning time for the men’s 1500 meters in the 
2020 Olympics is 
G = 12.325411 — 0.004376(2020) = 3.486. (9.6.12) 


Exercise 9.6.8 provides an estimate (predictive interval) of error for this prediction. 
a 


Winning time 


0.1 02 0.3 


Residuals 


-0.1 
1 
° 
° 
° 


Fitted values 


Figure 9.6.2: The top panel is the scatterplot of winning times in the men’s 1500 
meters versus the year of the Olympics. The least squares fit is overlaid. The 
bottom panel is the residual plot of the fit. 


9.6.2 *Geometry of the Least Squares Fit 


In the modern literature, linear models are usually expressed in terms of matrices 
and vectors, which we briefly introduce in this example. Furthermore, this allows 
us to discuss the simple geometry behind the least squares fit. Consider then Model 
(9.6.1). Write the vectors Y = (Y%,...,Yn)/, e = (e1,..-,@n)’, and x. = (a1 — 


Z,...,Up — Z)’. Let 1 denote the n x 1 vector whose components are all 1. Then 


pee 
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Model (9.6.1) can be expressed equivalently as 


Y = al+(6x,.+e 


= ax]( § ) +e 


XB+e, (9.6.13) 


I 


where X is the n x 2 matrix with columns 1 and x, and 8 = (a,()’. Next, let 
6 = E(Y) = X@. Finally, let V be the two-dimensional subspace of R” spanned by 
the columns of X; i.e., V is the range of the matrix X. Hence we can also express 
the model succinctly as 

Y=O+e, dev. (9.6.14) 


Hence, except for the random error vector e, Y would lie in V. It makes sense 
intuitively then, as suggested by Figure 9.6.3, to estimate @ by the vector in V that 
is “closest” (in Euclidean distance) to Y, that is, by 0, where 


6 = Argming.,,|Y — 6||?, (9.6.15) 


where the square of the Euclidean norm is given by ||u||? = 77, u?, for ue R”. 
As shown in Exercise 9.6.13 and depicted on the plot in Figure 9.6.3, 6 = 41+ Bx, 
where @ and B are the least squares estimates given above. Also, the vector é = 
Y — @ is the vector of residuals and né? = ||é||?. Also, just as depicted in Figure 
9.6.3, the angle between the vectors 6 and éisa right angle. In linear models, we 
say that 6 is the projection of Y onto the subspace V. 


Figure 9.6.3: The sketch shows the geometry of least squares. The vector of 
responses is Y, the fit is 8, and the vector of residuals is é. 
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EXERCISES 


9.6.1. Obtain the least squares estimates for the model y; = a* + Gx; +e; by min- 
imizing the sum of squares given in expression (9.6.11). Determine the distribution 
of a*. 


9.6.2. Students’ scores on the mathematics portion of the ACT examination, x, 
and on the final examination in the first-semester calculus (200 points possible), y 
are: 


= Sl DL wl Bl RRL, Mls] 
A Se 


2% 31 


Para aor ee ae 10 


The data are also in the rda file regri.rda. Use R or another statistical package 
for computation and plotting. 


(a) Calculate the least squares regression line for these data. 
(b) Plot the points and the least squares regression line on the same graph. 
(c) Obtain the residual plot and comment on the appropriateness of the model. 


(d) Find 95% confidence interval for @ under the usual assumptions. Comment 
in terms of the problem. 


9.6.3 (Telephone Data). Consider the data presented below. The responses (y) for 
this data set are the numbers of telephone calls (tens of millions) made in Belgium 
for the years 1950 through 1973. Time, the years, serves as the predictor variable 
(x). The data are discussed on page 172 of Hettmansperger and McKean (2011) 
and are in the file telephone.rda. 


Year 50 51 52 53 54 55 
No. Calls | 0.44 0.47 047 0.59 0.66 0.73 
Year 56 57 58 59 60 61 
No. Calls | 0.81 0.88 1.06 1.20 1.35 1.49 
Year 62 63 64 65 66 67 
No. Calls 1.61 2.12 11.90 12.40 14.20 15.90 
Year 68 69 70 71 72 73 
No. Calls | 18.20 21.20 430 2.40 2.70 2.90 
(a) Calculate the least squares regression line for these data. 


(b) Plot the points and the least squares regression line on the same graph. 


(c) What is the reason for the poor least squares fit? 


9.6.4. Show that the covariance between @ and B is zero. 


9.6.5. Find (1 — a@)100% confidence intervals for the parameters a and ( in Model 
(9.6.1). 
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9.6.6. Consider Model (9.6.1). Let 79 = E(Y|x = a — 7%). The least squares 
estimator of 79 is fo = @ + B(x — TZ). 
(a) Using (9.6.9), show that fo is an unbiased estimator and show that its variance 


is given by 
(zo — &)? 


ie (21 =)" 


(b) Obtain the distribution of 7p and use it to determine a (1—a)100% confidence 
interval for 7. 


1 
V(fo) = 07 |= + 
nr 


9.6.7. Assume that the sample (21, Y1),..-.,(@n, Yn) follows the linear model (9.6.1). 
Suppose Yo is a future observation at 7 = x9 — ¥ and we want to determine a pre- 
dictive interval for it. Assume that the model (9.6.1) holds for Yo; i.e., Yo has a 
N(a+ B(a9 — Z), 07) distribution. We use fo of Exercise 9.6.6 as our prediction of 
Yo. 


(a) Obtain the distribution of Yo — jo, showing that its variance is: 


ee 1 (xo —z) 
V XY = = 2 i = =n 
(Yoo) =o |1+ a GaP 


Use the fact that the future observation Yo is independent of the sample 
(x1, Yi), sey Game Yi): 


(b) Determine a t-statistic with numerator Yo — fo. 


(c) Now beginning with 1 — a = P[—tyjon-2 < T < te/on—2], where 0 <a <1, 
determine a (1 — a)100% predictive interval for Yo. 


(d) Compare this predictive interval with the confidence interval obtained in Ex- 
ercise 9.6.6. Intuitively, why is the predictive interval larger? 


9.6.8. In Example 9.6.1, we obtain the predicted winning time for the men’s 1500 
meters in the 2020 Olympics. Compute the 95% predictive interval for this predic- 
tion that is given in the last exercise. These computations are performed by the R 
function cipi.R. The call is cipi(1m(time~year) ,matrix(c(1,2020) ,ncol=2)). 
In terms of the problem, what does this predictive interval mean? Next compute 
the prediction for the 2024 and 2028 Olympics. Why are the intervals increasing in 
length? 
9.6.9. Show that 

n . n n 
Di —a— Ales -B)P = n(6—a)? + (6-8) Dei -2)? + i - 4 Aas —2)P. 
i=1 i=1 i=1 
9.6.10. Let the independent random variables Y), Y2,...,Y;, have, respectively, the 
probability density functions N(3x;, 2x7), i = 1,2,...,n, where the given numbers 
%1,%2,...,%, are not all equal and no one is zero. Find the maximum likelihood 
estimators of 3 and 77. 
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9.6.11. Let the independent random variables Y;,...,Y¥;, have the joint pdf 


1 n/2 it n 
L(a, 8,07) = (==) Sate tno ate af, 


where the given numbers 21,22,...,@ are not all equal. Let Ho : 8 = 0 (a and 
o” unspecified). It is desired to use a likelihood ratio test to test Ho against all 
possible alternatives. Find A and see whether the test can be based on a familiar 
statistic. 


Hint: In the notation of this section, show that 


Soi - 6)? =Q3+ 2 (ai - 2)? 


9.6.12. Using the notation of Section 9.2, assume that the means ju; satisfy a linear 
function of j, namely, wu; = c+ d[j — (b+ 1)/2]. Let independent random samples 
of size a be taken from the b normal distributions having means ju1, [2,..-, Lo, 
respectively, and common unknown variance o?. 


(a) Show that the maximum likelihood estimators of c and d are, respectively, 
é=X. and 
bf. — 
3 Vesnild — © D/2(%X 5 —X..) 
=< Ab pe . fin AN fai!!! 
Soild — (b+ 1)/2)? 
(b) Show that 


Sige = Ss [xs _X, -a(s- bir 


i=1 j=1 i=1 j=1 
b 


(c) Argue that the two terms in the right-hand member of part (b), once divided 
by a”, are independent random variables with y? distributions provided that 
d=0. 


(d) What F-statistic would be used to test the equality of the means, that is, 
Ho :d=0? 


9.6.13. Consider the discussion in Section 9.6.2. 


(a) Show that @ = G1 + (xc, where @ and @ are the least squares estimators 
derived in this section. 


(b) Show that the vector é = Y — @ is the vector of residuals; ie., its ith entry is 
é;, (9.6.7). 
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(c) As depicted in Figure 9.6.3, show that the angle between the vectors 6 and é 
is a right angle. 
(d) Show that the residuals sum to zero; i.e., 1/6 = 0. 


9.6.14. Fit y=a+-2 to the data 


| oO 
Wir 
| bo 


by the method of least squares. 


9.6.15. Fit by the method of least squares the plane z = a+ bx + cy to the five 
points (a, y, z) : (—1, —2,5), (0, —2, 4), (0,0, 4), (1,0, 2), (2,1, 0). 

Let the R vectors x,y,z contain the values for x,y, and z. Then the LS fit is 
computed by lm(z ~ x + y). 


9.6.16. Let the 4 x 1 matrix Y be multivariate normal N(X,0o7I), where the 
4 x 3 matrix X equals 


1 1 2 
1: = 2 
aS 1 0 —3 
1 0 -l 


and @ is the 3 x 1 regression coefficient matrix. 
(a) Find the mean matrix and the covariance matrix of @ = (X'X)~!X’Y. 
(b) If we observe Y’ to be equal to (6,1, 11,3), compute p. 


9.6.17. Suppose Y is an n x 1 random vector, X is an n xX p matrix of known 
constants of rank p, and 3 is a p x 1 vector of regression coefficients. Let Y have a 
N(XB,o71) distribution. Obtain the pdf of 8 = (X'X)1X’Y. 


9.6.18. Let the independent normal random variables Y;, Y2,...,Y, have, respec- 
tively, the probability density functions N(y,y?x?7), i =1,2,...,n, where the given 
%1,%2,...,%, are not all equal and no one of which is zero. Discuss the test of 
the hypothesis Hp : y = 1, « unspecified, against all alternatives H, : y 4 1, pu 
unspecified. 


9.7 <A Test of Independence 


Let X and Y have a bivariate normal distribution with means 4 and juz, posi- 
tive variances 0? and o3, and correlation coefficient p. We wish to test the hy- 
pothesis that X and Y are independent. Because two jointly normally distributed 
random variables are independent if and only if p = 0, we test the hypothesis 
Hy : p = 0 against the hypothesis H; : p # 0. A likelihood ratio test is used. 


Let (X1, Y1), (Xe, Y2),...,(Xn, Yn) denote a random sample of size n > 2 from the 
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bivariate normal distribution; that is, the joint pdf of these 2n random variables is 
given by 
f (x1, yi) f(@2, ya) +++ fn; Yn): 


Although it is fairly difficult to show, the statistic that is defined by the likelihood 
ratio A is a function of the statistic, which is the mle of p, namely, 


wee(Xi- XV -¥) 
ei (Xi ~ x)? ye (Yi = ¥} 


This statistic R is called the sample correlation coefficient of the random sam- 
ple. Following the discussion after expression (5.4.5), the statistic R is a consistent 
estimate of p; see Exercise 9.7.5. The likelihood ratio principle, which calls for the 
rejection of Ho if A < Apo, is equivalent to the computed value of |R| > c. That 
is, if the absolute value of the correlation coefficient of the sample is too large, we 
reject the hypothesis that the correlation coefficient of the distribution is equal to 
zero. To determine a value of c for a satisfactory significance level, it is necessary 
to obtain the distribution of R, or a function of R, when Hp is true, as we outline 
next. 

Let X, = 21,X2q = %o,...,Xn = In, N > 2, where 21,%,...,%, and FT = 
>} xi /n are fixed numbers such that 5>}'(z;—)? > 0. Consider the conditional pdf 
of Y1, Yo,..., Yn given that X1 = 71, Xo = %,...,Xyn = Xp. Because Yj, Y2,...,Yn 
are independent and, with p = 0, are also independent of Xj, X2,...,Xn, this 
conditional pdf is given by 


il n 1 n ‘ 
——— |] exp< --—>5 i— . 
( a nf 202 dy H2) 


Let R, be the correlation coefficient, given X1 = 271, X2 = %2,..., Xn = Ln, so that 


R= (9.7.1) 


(9.7.2) 


is 3, expression (9.6.5) of Section 9.6. Conditionally the mean of Y; is jug; i.e., a 
constant. So here expression (9.7.2) has expectation 0 which implies that F(R.) = 0. 
Next consider the t-ratio of 6 given by T> of expression (9.6.10) of Section 9.6. In 
this notation Tz can be expressed as 


vee — =)? _ Ren =2 
(23-2) 7)(@:-y V 1— Re 


(9.7.3) 


Th = 


ha {vir¥- [Reo W-¥)2/ 


m2) Dy (45 — a 
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Thus To, given X, = 2,..., Xn = Lp, has a conditional t-distribution with n — 2 
degrees of freedom. Note that the pdf, say g(t), of this t-distribution does not depend 
upon 21,%2,...,2n. Now the joint pdf of X), X2,...,Xp and Ri/n—2/V/1— R?, 
where R is given by expression (9.7.1), is the product of g(t) and the joint pdf of 
X1,...,Xn. Integration on 71,...,@n yields the marginal pdf of RV/n — 2/V1 — R?; 
because g(t) does not depend upon 21, #2,..., Un, it is obvious that this marginal pdf 
is g(t), the conditional pdf of Rv/n — 2//1— R?. The change-of-variable technique 
can now be used to find the pdf of R. 


Remark 9.7.1. Since R has, when p = 0, a conditional distribution that does not 
depend upon #1, 22,...,2n (and hence that conditional distribution is, in fact, the 
marginal distribution of R), we have the remarkable fact that R is independent of 
X 1, X2,...,Xn. It follows that R is independent of every function of X1, X2,...,Xn 
alone, that is, a function that does not depend upon any Y;. In like manner, FR is 
independent of every function of Yj, Y2,...,Y, alone. Moreover, a careful review of 
the argument reveals that nowhere did we use the fact that X has a normal marginal 
distribution. Thus, if X and Y are independent, and if Y has a normal distribution, 
then R has the same conditional distribution whatever the distribution of X, subject 
to the condition }*7 (x; — Z)? > 0. Moreover, if P[}7}(X; — X)? > 0] = 1, then R 
has the same marginal distribution whatever the distribution of X. m 


If we write T = RVn — 2/V/1— R?, where T has a ¢-distribution with n— 2 > 0 
degrees of freedom, it is easy to show by the change-of-variable technique (Exercise 
9.7.4) that the pdf of R is given by 


T[(n=1)/2] _ p2)\(n—4)/2 0 
AG = Fp rin—-a7qt—") seas (9.7.4) 


0 elsewhere. 
We have now solved the problem of the distribution of R, when p = 0 and n > 2, 
or perhaps more conveniently, that of RVn — 2/V1— R?. The likelihood ratio test 
of the hypothesis Hp : p = 0 against all alternatives H; : p 4 0 may be based either 


on the statistic R or on the statistic Ryn — 2/V1— R? = T, although the latter is 
easier to use. Therefore, a level a test is to reject Ho : p = 0 if |T| > ta2n—2- 


Remark 9.7.2. It is possible to obtain an approximate test of size a by using the 
fact that i ics 
+ 
W=-l ——— 
ae (; = =) 


has an approximate normal distribution with mean 4 log[(1 + p)/(1— p)] and with 
variance 1/(n — 3). We accept this statement without proof. Thus a test of Ho : 
p = 0 can be based on the statistic 


_ Flog[(1 + R)/(1 — R)] — Hlogl(1 + 9)/(- p)) 
1] — 3) 


Z (9.7.5) 
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with p = 0 so that 4 log[(1 + p)/(1 — p)] = 0. However, using W, we can also test 
a hypothesis like Hp : p = po against Hi : p # po, where po is not necessarily zero. 
In that case, the hypothesized mean of W is 


1 1 
— log =i : 
2 1— po 


Furthermore, as outlined in Exercise 9.7.6, Z can be used to obtain an asymptotic 
confidence interval for p. m 


EXERCISES 


9.7.1. Show that 


9.7.2. A random sample of size n = 6 from a bivariate normal distribution yields 
a value of the correlation coefficient of 0.89. Would we accept or reject, at the 5% 
significance level, the hypothesis that p = 0? 


9.7.3. Verify Equation (9.7.3) of this section. 
9.7.4. Verify the pdf (9.7.4) of this section. 


9.7.5. Using the results of Section 4.5, show that R, (9.7.1), is a consistent estimate 
of p. 


9.7.6. By doing the following steps, determine a (1 — a)100% approximate confi- 
dence interval for p. 


(a) For 0 < a < 1, in the usual way, start with 1 —a = P(—Za/2 < Z < Zq/2), 
where Z is given by expression (9.7.5). Then isolate h(p) = (1/2) log [(1 + 
p)/(1—)] in the middle part of the inequality. Find h’(p) and show that it is 
strictly positive on —1 < p < 1; hence, h is strictly increasing and its inverse 
function exists. 


(b) Show that this inverse function is the hyperbolic tangent function given by 
tanh(y) = (eY — e~¥)/(e¥ + e7¥). 


(c) Obtain a (1 — a)100% confidence interval for p. 


9.7.7. The intrinsic R function cor.test (x,y) computes the estimate of p and the 
confidence interval in Exercise 9.7.6. Recall the baseball data which is in the file 
bb.rda. 
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(a) Using the baseball data, determine the estimate and the confidence interval for 
the correlation coefficient between height and weight for professional baseball 
players. 


(b) Separate the pitchers and hitters and for each obtain the estimate and confi- 
dence for the correlation coefficient between height and weight. Do they differ 
significantly? 


(c) Argue that the difference in the estimates of the correlation coefficients is the 
mle of p; — p2 for two independent samples, as in Part (b). 


9.7.8. Two experiments gave the following results: 


n TC Y Se Sy ig 
100 10 20 5 8 0.70 
200 12 22 6 10° 0.80 


Calculate r for the combined sample. 


9.8 The Distributions of Certain Quadratic Forms 


Remark 9.8.1. It is essential that the reader have the background of the multi- 
variate normal distribution as given in Section 3.5 to understand Sections 9.8 and 
9.9. & 


Remark 9.8.2. We make use of the trace of a square matrix. If A = [a;,;] is an 
n X n matrix, then we define the trace of A, (tr A), to be the sum of its diagonal 
entries; i.e., 


trA=)° ay. (9.8.1) 
i=1 


The trace of a matrix has several interesting properties. One is that it is a linear 
operator; that is, 


tr (aA + 6B) =atrA + btrB. (9.8.2) 


A second useful property is: If A is an n x m matrix, B is an m x k matrix, and C 
isa k xn matrix, then 


tr (ABC) = tr(BCA) = tr(CAB). (9.8.3) 


The reader is asked to prove these facts in Exercise 9.8.7. Finally, a simple but 
useful property is that tra = a, for any scalar a. & 


We begin this section with a more formal but equivalent definition of a quadratic 
form. Let X = (Xj,...,X,) be an n-dimensional random vector and let A be a 
real n X n symmetric matrix. Then the random variable Q = X’AX is called a 
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quadratic form in X. Due to the symmetry of A, there are several ways we can 
write Q: 


n 


i=l g=1 i=l ij 
n 

= S 5 aii X? + 2) ya. 
i=l t<Jj 


(9.8.5) 


These are very useful random variables in analysis of variance models. As the 
following theorem shows, the mean of a quadratic form is easily obtained. 


Theorem 9.8.1. Suppose the n-dimensional random vector X has mean pe and 
variance—covariance matrix &. Let Q = X'AX, where A is a realn x n symmetric 
matriz. Then 


E(Q) =trAdS+p’Au. (9.8.6) 
Proof: Using the trace operator and property (9.8.3), we have 


E(Q) = E(trX’AX) = E(tr AXX’) 
= trAE(XX’) 
= trA(=+ py’) 
= trAX+yp’Aun, 


where the third line follows from Theorem 2.6.3. m 


Example 9.8.1 (Sample Variance). Let X’ = (X1,...,X,) be an n-dimensional 
vector of random variables. Let 1’ = (1,...,1) be the n-dimensional vector whose 
components are 1. Let I be the n x n identity matrix. Consider the quadratic form 
Q = X/(I— 4J)X, where J = 11’; ie., J is an n x n matrix with all entries equal 
to 1. Note that the off-diagonal entries of (I— +,J) are —n~! while the diagonal 


entries are 1 — n~!; hence, by (9.8.4), Q simplifies to 


= 1 
Q= yexe(1-s 


= w=1 g=1 i=1 
n 
= S°x?-nX’ =(n- 18", (9.8.7) 
i= 
where X and S$? denote the sample mean and variance of X,,..., Xn. 
Suppose we further assume that X1,..., X, are iid random variables with com- 


2 


mon mean #4 and variance o*. Using Theorem 9.8.1, we can obtain yet another 
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proof that S? is an unbiased estimate of a7. Note that the mean of the random 
vector X is 1 and that its variance-covariance matrix is 77I. Based on Theorem 
9.8.1, we find immediately that 


E(S?) = — {tt (1 _ +3) ol + pi? (11 - *y11'1)} =o. © 
n— nm nr 


The spectral decomposition of symmetric matrices proves quite useful in this part 

of the chapter. As discussed around expression (3.5.8), a real symmetric matrix A 
can be diagonalized as 

A=TI'AT, (9.8.8) 


where A is the diagonal matrix A = diag(Ai,...,An), A1 > +--+ > An are the eigen- 
values of A, and the columns of I’ = [v; ---v,] are the corresponding orthonormal 
eigenvectors (i.e., F is an orthogonal matrix). Recall from linear algebra that the 
rank of A is the number of nonzero eigenvalues. Further, because A is diagonal, we 
can write this expression as 


A=) dw). (9.8.9) 
i=1 


The R command to compute the spectral decomposition of A is sdc=eigen(amat), 
where amat is the R matrix for A. The eigenvalues and eigenvectors are in the 
respective attributes sdc$values and sdc$vectors. For normal random variables, 
we make use of equation (9.8.9) to obtain the mgf of the quadratic form Q in the 
next theorem, Theorem 9.8.2. 


Theorem 9.8.2. Let X' = (Xj,...,Xn), where X1,...,Xn are tid N(0,07). Con- 
sider the quadratic form Q = 0~?X' AX for a symmetric matric A of rank r <n. 
Then Q has the moment generating function 


Tr 


M(t) =] [@—-2ta)-? = |[- 2ta|-””, (9.8.10) 
i=1 
where A1,...,Ar are the nonzero eigenvalues of A, \t| < 1/(2\*), and the value of 


A* is given by AX* = maxi<i<r |Ai|- 


Proof: Write the spectral decomposition of A as in expression (9.8.9). Since the 
rank of A is r, exactly r of the eigenvalues are not 0. Denote the nonzero eigenvalues 
by Aq,..-,A-. Then we can write Q as 


Q=>_ dla *vix)?. (9.8.11) 
i=1 
Let I, = [vi---v,] and define the r-dimensional random vector W by W = 


o ‘T,X. Since X is N,,(0,07I,,) and [T, = I,, Theorem 3.5.2 shows that W 
has a N,.(0,1,-) distribution. In terms of the W;, we can write (9.8.11) as 


Q=) > W?. (9.8.12) 
w=1 
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Because W1,...,W, are independent N(0,1) random variables, W?,...,W, are 
independent y?(1) random variables. Thus the mef of Q is 


exp > tyiW? \ 
i=l 


r 


= Il Elexp{t\W?}] = [[(1 — 2ta,)-”?. (9.8.13) 


i=l i=1 


Elexp{tQ}] BE 


The last equality holds if we assume that |t| < 1/(2A*), where A* = maxi<i<r |Ai|; 
see Exercise 9.8.6. To obtain the second form in (9.8.10), recall that the determinant 
of an orthogonal matrix is 1. The result then follows from 


\I—2tA| = |’ —-2t’AT| = |I’(1- 2taAyr| 
7 -2 
= |I-2taA|= {IIe 7 ayn | 
i=1 
Example 9.8.2. To illustrate this theorem, suppose X;, 7 = 1, 2,...,", are in- 
dependent random variables with X; distributed as N(ji,07), i ie ,n, re- 


spectively. Let Z; = (X; — i)/o;i. We know that >>)", Z? has a: a x? disibutinn 
with n degrees of freedom. To illustrate Theorem 9.8.2, Tet Z’ = (Z1,...,Zn). Let 
Q = Z'IZ. Hence the symmetric matrix associated with Q is the identity matrix I, 
which has n eigenvalues, all of value 1; os A; = 1. By Theorem 9.8.2, the mgf of 
Q is (1 — 2t)-"/?; ie., Q is distributed x? when degrees of freedom. m 


In general, from Theorem 9.8.2, note how close the mgf of the quadratic form 
Q is to the mef of a y? distribution. The next two theorems give conditions where 
this is true. 


Theorem 9.8.3. Let X' = (X1, X2,...,Xn) have a N,(p,¥) distribution, where 
X is positive definite. Then Q = (X — p)'=~1(X — p) has a y?(n) distribution. 


Proof: Write the spectral decomposition of © as © =I’ AT, where T is an orthog- 
onal matrix and A = diag{\1,...,An} is a diagonal matrix whose diagonal entries 
are the eigenvalues of &. Because © is positive definite, all A; > 0. Hence we can 
write 

Se AST = APPA Pr. 


where A~!/? = diag{A,/”, sae PaaS Thus we have 
/ 
o= {A-¥?0(x 2 u)} I {a-?r(K i »)} 


But by Theorem 3.5.2, it is easy to show that the random vector A~!/?T'(X — p) 
has a N,,(0,1) distribution; hence, Q has a y?(n) distribution. 


The remarkable fact that the random variable Q in the last theorem is y?(n) 
stimulates a number of questions about quadratic forms in normally distributed 
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variables. We would like to treat this problem generally, but limitations of space 
forbid this, and we find it necessary to restrict ourselves to some special cases; see, 
for instance, Stapleton (2009) for discussion. 

Recall from linear algebra that a symmetric matrix A is idempotent if A? = A. 
In Section 9.1, we have already met some idempotent matrices. For example, the 
matrix I — +J of Example 9.8.1 is idempotent. Idempotent matrices possess some 
important characteristics. Suppose \ is an eigenvalue of an idempotent matrix A 
with corresponding eigenvector v. Then the following identity is true: 


\v = Av = A?’v = \Av = Dv. 


Hence (A — 1)v = 0. Since v 4 0, A = 0 or 1. Conversely, if the eigenvalues 
of a real symmetric matrix are only Os and 1s then it is idempotent; see Exercise 
9.8.10. Thus the rank of an idempotent matrix A is the number of its eigenvalues 
which are 1. Denote the spectral decomposition of A by A = I’AT, where A is a 
diagonal matrix of eigenvalues and [ is an orthogonal matrix whose columns are 
the corresponding orthonormal eigenvectors. Because the diagonal entries of A are 
0 or 1 and [ is orthogonal, we have 


tr A = tr ATT” = tr A = rank(A); 
i.e., the rank of an idempotent matrix is equal to its trace. 


Theorem 9.8.4. Let X’ = (Xj,...,Xn), where X1,...,Xn are iid N(0,07). Let 
Q=0-*X’AX for a symmetric matric A with rank r. Then Q has a y?(r) distri- 
bution if and only if A is idempotent. 


Proof: By Theorem 9.8.2, the mgf of Q is 


Y 


Malt) = [[G-2ta:)-?, (9.8.14) 

i=1 
where A;,...,A, are the r nonzero eigenvalues of A. Suppose, first, that A is 
idempotent. Then A; = --. = \, = 1 and the mef of Q is Mg(t) = (1 — 2t)~"/?; 


ie., Q has a y7(r) distribution. Next, suppose Q has a x7(r) distribution. Then 
for t in a neighborhood of 0, we have the identity 


[[G@ - 20a) 7? = 1-28)”, 
i=1 
which, upon squaring both sides, leads to 


r 


[[G@ - 2ta) = @ - 28)", 


i=l 


By the uniqueness of the factorization of polynomials, Aj =--- = A, = 1. Hence A 
is idempotent. m 
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Example 9.8.3. Based on this last theorem, we can obtain quickly the distri- 
bution of the sample variance when sampling from a normal distribution. Sup- 
pose X, X9,..., Xp are iid N(p,07). Let KX = (X1,Xo,...,Xn)’. Then X has a 
Ny (1, 071) distribution, where 1 denotes an x 1 vector with all components equal 
to 1. Let S? = (n—1)~! 33, (X; — X)?. Then by Example 9.8.1, we can write 


—1)S? 1 1 
ie =o x’ (x _ ~3) X=o 7(X—piy (1 _ 3) (X — p1), 

or n n 
where the last equality holds because ( - +J ) 1=0. Because the matrix I— 1J is 
idempotent, tr (I— 4J) =n—1, and X— 11 is N,,(0, 071), it follows from Theorem 
9.8.4 that (n — 1).$*/o? has a x?(n — 1) distribution. m 


Remark 9.8.3. If the normal distribution in Theorem 9.8.4 is N,,(1,071), the 
condition A? = A remains a necessary and sufficient condition that Q/o? have a 
chi-square distribution. In general, however, Q/o? is not central .?(r) but instead, 
Q/o? has a noncentral chi-square distribution if A? = A. The number of degrees 
of freedom is r, the rank of A, and the noncentrality parameter is u/Ap/o?. If 
= pl, then p/Ap = p70, aij, where A = [aij]. Then, if » 4 0, the condi- 
tions A? = A and Soa; = 0 are necessary and sufficient conditions that Q/o? 
iJ 

be central x?(r). Moreover, the theorem may be extended to a quadratic form in 
random variables which have a multivariate normal distribution with positive def- 
inite covariance matrix 4; here the necessary and sufficient condition that Q have 
a chi-square distribution is ANA = A. See Exercise 9.8.9. m 


EXERCISES 


9.8.1. Let Q = X1X2 — X3X4, where X1, X2, X3, X4 is a random sample of size 4 
from a distribution that is N(0,07). Show that Q/o? does not have a chi-square 
distribution. Find the mgf of Q/o?. 


9.8.2. Let X’ = [X1, X2] be bivariate normal with matrix of means pw’ = [1, [9] 
and positive definite covariance matrix %. Let 
Qi = Pe X1X2 Pe 
aU P) Parole) 03 — p)’ 
Show that Q, is y?(r,@) and find r and 6. When and only when does Q; have a 
central chi-square distribution? 


9.8.3. Let X’ = [X1, X2, X3] denote a random sample of size 3 from a distribution 
that is N(4,8) and let 


7 0 3 
A=(0 1 0 
7 03 


Let Q = X'AX/o?. 
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(a) Use Theorem 9.8.1 to find the E(Q). 
(b) Justify the assertion that Q is y?(2,6). 


9.8.4. Suppose Xj,...,X, are independent random variables with the common 
mean yp but with unequal variances 0? = Var(X;). 


(a) Determine the variance of X. 


(b) Determine the constant K so that Q = K >7_,(X; — X)? is an unbiased 
estimate of the variance of X. (Hint: Proceed as in Example 9.8.3.) 


9.8.5. Suppose X,,...,X, are correlated random variables, with common mean ju 
and variance o? but with correlations p (all correlations are the same). 


(a) Determine the variance of X. 


(b) Determine the constant K so that Q = KY, (Xi — X)? is an unbiased 
estimate of the variance of X. (Hint: Proceed as in Example 9.8.3.) 


9.8.6. Fill in the details for expression (9.8.13). 


9.8.7. For the trace operator defined in expression (9.8.1), prove the following 
properties are true. 


(a) If A and B are n x n matrices and a and 6 are scalars, then 


tr(aA + 6B) =atrA+ btrB. 


(b) If A is ann x m matrix, B is an m x k matrix, and C is a k x n matrix, then 
tr (ABC) = tr (BCA) = tr(CAB). 


(c) If A is a square matrix and [ is an orthogonal matrix, use the result of part 
(a) to show that tr(I’AT) = trA. 


(d) If A is a real symmetric idempotent matrix, use the result of part (b) to prove 
that the rank of A is equal to trA. 


9.8.8. Let A = [a;j] be a real symmetric matrix. Prove that }>, >>, a7; is equal to 
the sum of the squares of the eigenvalues of A. 

Hint: If T is an orthogonal matrix, show that }), 0, a7; = tr(A?) = tr(I’A*T) = 
tr[((I’ AT) (I” AT)]. 


9.8.9. Suppose X has a N,,(0,%) distribution, where } is positive definite. Let 
Q = X’ AX for asymmetric matrix A with rank r. Prove Q has a y?(r) distribution 
if and only if ANA =A. 
Hint: Write Q as 

Q = (SeXy SAS (1X), 
where }1/? = IMA!/?P and © = IAT is the spectral decomposition of ©. Then 
use Theorem 9.8.4. 


9.8.10. Suppose A is a real symmetric matrix. If the eigenvalues of A are only Os 
and 1s then prove that A is idempotent. 
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9.9 The Independence of Certain Quadratic Forms 


We have previously investigated the independence of linear functions of normally 
distributed variables. In this section we shall prove some theorems about the in- 
dependence of quadratic forms. We shall confine our attention to normally dis- 
tributed variables that constitute a random sample of size n from a distribution 
that is N(0,07). 


Remark 9.9.1. In the proof of the next theorem, we use the fact that if A is an 
m Xm matrix of rank n (ie., A has full column rank), then the matrix A’A is 
nonsingular. A proof of this linear algebra fact is sketched in Exercises 9.9.12 and 
9.9.13. mf 


Theorem 9.9.1 (Craig). Let X’ = (Xi,..., Xn), where X1,..., Xn are iid N(0, 07) 
random variables. For real symmetric matrices A and B, let Q, = 0~?X'AX and 
Q2 = 0 *X'BX denote quadratic forms in X. The random variables Q, and Qz 
are independent if and only if AB = 0. 


Proof: First, we obtain some preliminary results. Based on these results, the proof 
follows immediately. Assume the ranks of the matrices A and B are r and s, 
respectively. Let I',A,TF, denote the spectral decomposition of A. Denote the r 
nonzero eigenvalues of A by Aj,...,A;-. Without loss of generality, assume that 
these nonzero eigenvalues of A are the first r elements on the main diagonal of Ay 
and let I}, be the n x r matrix whose columns are the corresponding eigenvectors. 
Finally, let Aq = diag {A1,...,A,}. Then we can write the spectral decomposition 
of A in either of the two ways 


A=MAW, =T), AnTu. (9.9.1) 
Note that we can write Q, as 
Qi = o ?X’T)  AnTuX => o (11, X)/ Ay (111 X) = W'AuWi, (9.9.2) 


where W, = 07 !T'\,X. Next, obtain a similar representation based on the s nonzero 
eigenvalues 71,...,Y¥s of B. Let Ago = diag{71,...,7¥;} denote the s x s diagonal 
matrix of nonzero eigenvalues and form the n x s matrix T'4, = [u,--- us] of corre- 
sponding eigenvectors. Then we can write the spectral decomposition of B as 


Ber Aur: (9.9.3) 


Also, we can write Qo as 
Q2 = W5A22 We, (9.9.4) 


where W2 = 07 'T2,X. Letting W’ = (W{, W5), we have 


Because X has a N,,(0,07I) distribution, Theorem 3.5.2 shows that W has an 
(r + s)-dimensional multivariate normal distribution with mean O and variance— 
covariance matrix 


(9.9.5) 


/ 
Var (W) = | I Pai | : 


Tal, I, 
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Finally, using (9.9.1) and (9.9.3), we have the identity 
AB = {T, An} Pu 9, {AvP}. (9.9.6) 


Let U denote the matrix in the first set of braces. Note that U has full column 
rank, so its kernel is null; i.e., its kernel consists of the vector 0. Let V denote the 
matrix in the second set of braces. Note that V has full row rank, hence the kernel 
of V’ is null. 

For the proof then, suppose AB = 0. Then 


U (Put), V/s 0. 


Because the kernel of U is null this implies each column of the matrix in the brackets 
is the vector 0; i.e., the matrix in the brackets is the matrix 0. This implies that 


WV’ (Tail),] = 0. 


In the same way, because the kernel of V’ is null, we have T1;9P'4, = 0. Hence, by 
(9.9.5), the random vectors W; and W3 are independent. Therefore, by (9.9.2) and 
(9.9.4), Qi and Q2 are independent. 

Conversely, if Q; and Q2 are independent, then 


{Elexp{tiQi + t2Qo}]} * = {Elexp{tiQi}]} ” {Blexp{t2Q2}]}-*, (9.9.7) 


for (t;,¢2) in an open neighborhood of (0,0). Note that t:;Q1 + t2Q2 is a quadratic 
form in X with symmetric matrix t; A+t2B. Recall that the matrix I, is orthogonal 
and hence has determinant +1. Using this and Theorem 9.8.2, we can write the left 
side of (9.9.7) as 


E~*[exp{tiQi + t2Q2}] = [In — 2t,A — 2t2B| 
= Viti = 21 Ali = 2taT) (Vi, Br )ri| 
= |I, —2t,A1 — 2t2D|, (9.9.8) 


where the matrix D is given by 


(9.9.9) 


D=T,Br" = | Dir Diz | : 


Da Doe 


and Dy is r x r. By (9.9.2), (9.9.3), and Theorem 9.8.2, the right side of (9.9.7) 
can be written as 


{ Elexp{tiQi}]} ° {Elexp{t2Q2}]}-? = {Te a 2n)| I, —2t2D|. (9.9.10) 


i=1 
This leads to the identity 


r 


I, — 2t,A1 — 2teD| = {Te = 2nn)| [I, — 2teDI, (9.9.11) 


i=1 
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for (t1,¢2) in an open neighborhood of (0,0). 

The coefficient of (—2t,)" on the right side of (9.9.11) is Ay --- A-|I—2t2D]. It is 
not so easy to find the coefficient of (—2t,)” in the left side of the equation (9.9.11). 
Conceive of expanding this determinant in terms of minors of order r formed from 
the first r columns. One term in this expansion is the product of the minor of order 
r in the upper left-hand corner, namely, |I,. — 2t;A1, — 2t2D1,|, and the minor of 
order n — r in the lower right-hand corner, namely, |I,,_,, — 2t2D22|. Moreover, this 
product is the only term in the expansion of the determinant that involves (—2t,)". 
Thus the coefficient of (—2t,)” in the left-hand member of Equation (9.9.11) is 
A+++ Ap|In—-r — 2t2D22|. If we equate these coefficients of (—2t,)", we have 


II — 2t2D| = |In—r — 2t2Daol, (9.9.12) 


for t2 in an open neighborhood of 0. Equation (9.9.12) implies that the nonzero 
eigenvalues of the matrices D and D2 are the same (see Exercise 9.9.8). Recall 
that the sum of the squares of the eigenvalues of a symmetric matrix is equal to the 
sum of the squares of the elements of that matrix (see Exercise 9.8.8). Thus the 
sum of the squares of the elements of matrix D is equal to the sum of the squares 
of the elements of D22. Since the elements of the matrix D are real, it follows that 
each of the elements of D,,, D142, and D2; is zero. Hence we can write 


0=A,D =P, AIT, BI, 


because [; is an orthogonal matrix, AB = 0. 


Remark 9.9.2. Theorem 9.9.1 remains valid if the random sample is from a distri- 
bution that is N (ju, 07), whatever the real value of 4. Moreover, Theorem 9.9.1 may 
be extended to quadratic forms in random variables that have a joint multivariate 
normal distribution with a positive definite covariance matrix H. The necessary and 
sufficient condition for the independence of two such quadratic forms with symmet- 
ric matrices A and B then becomes ANB = 0. In our Theorem 9.9.1, we have 
X= o7I, so that AUB = Ac*?IB=c?AB=0. 5 


The following theorem is from Hogg and Craig (1958). 


Theorem 9.9.2 (Hogg and Craig). Define the sum Q = Qi +++: + Qr-1+ Qk, 

where Q, Q1,---;Qr—1, Qe are k+1 random variables that are quadratic forms in the 

observations of a random sample of size n from a distribution that is N(0,07). Let 

Q/o? be x?(r), let Q;/o? be x7(ri), i =1,2,...,k —1, and let Q; be nonnegative. 

Then the random variables Q1,Q2,...,Qx are independent and, hence, Q,/a? is 
Df os 

MA (Th =O — 11 — +++ — TR-1). 


Proof: Take first the case of k = 2 and let the real symmetric matrices Q,Q, and 
Q»2 be denoted, respectively, by A, A,,A2. We are given that Q = Qi + Qe or, 
equivalently, that A = A; + Ao. We are also given that Q/o? is y?(r) and that 
Q1/o? is y?(r1). In accordance with Theorem 9.8.4, we have A? = A and Aj = A. 
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Since Q2 > 0, each of the matrices A, A;, and Ag is positive semidefinite. Because 
A’ = A, we can find an orthogonal matrix I such that 


I, a | 


/ _ 
cars| O 


If we multiply both members of A = A; + Ap» on the left by I’ and on the right by 
IT, we have 

I, O 

Ee: 

Now each of A; and Ag, and hence each of I’A,;T and Iv AgI is positive semidefi- 

nite. Recall that if a real symmetric matrix is positive semidefinite, each element on 

the principal diagonal is positive or zero. Moreover, if an element on the principal 


diagonal is zero, then all elements in that row and all elements in that column are 
zero. Thus I’AT =I” A, +I’ AoE can be written as 


a oe alee ol (9.9.13) 


Since AT = Aj, we have 


| =lA,T+I’AS. 


(I’ AT)? _— VAL = | G,,. 0 | 


0 O 


If we multiply both members of Equation (9.9.13) on the left by the matrix IY AiT, 
we see that 
G, 0] _ | G, O G,H, 0 
| 0 le 0 0 {+ | 0 al 

or, equivalently, I’A,T = TW’ Ai; T+(1’ AiD)(T’ Ao). Thus (f’ Ai TP) x (T” Aol) = 0 
and A; A» = 0. In accordance with Theorem 9.9.1, Q; and Q2 are independent. 
This independence immediately implies that Q2/o? is y?(r2 =r — 11). This com- 
pletes the proof when k = 2. For k > 2, the proof may be made by induction. We 
shall merely indicate how this can be done by using k = 38. Take A = Aj +A2+As, 
where A? = A, A? = Aj, A = Ao, and As is positive semidefinite. Write 
A = A, +(A2+ A3) = Ai + Bi, say. Now A? = A, A? = Aj, and B, is 
positive semidefinite. In accordance with the case of k = 2, we have A,B, = 0, 
so that B] = B,. With B, = Az + A3, where B? = By, A} = Ag, it follows 
from the case of k = 2 that AgA3 = 0 and AB = As3. If we regroup by writing 
A = A, + (A; + Az), we obtain A; A3 = 0, and so on. m 


Remark 9.9.3. In our statement of Theorem 9.9.2, we took X1, X2,...,Xn to 
be observations of a random sample from a distribution that is N(0,07). We did 
this because our proof of Theorem 9.9.1 was restricted to that case. In fact, if 


Q’,Q),..-, Qj, are quadratic forms in any normal variables (including multivariate 
normal variables), if Q’ = Q{+---+Q}4,, if Q’, Q),..., Qj,_, are central or noncentral 
chi-square, and if Qj), is nonnegative, then Q{,...,Q{, are independent and Qj, is 


either central or noncentral chi-square. & 
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This section concludes with a proof of a frequently quoted theorem due to 
Cochran. 


Theorem 9.9.3 (Cochran). Let X1,X2,...,Xn denote a random sample from a 
distribution that is N(0,07). Let the sum of the squares of these observations be 
written in the form 


SX? = Q1 4+ Qat+---+Qk, 
1 


where Q; is a quadratic form in X1,Xo,...,Xn, with matrix A; that has rank 
rj, J = 1,2,...,k. The random variables Q1,Q2,...,Qr are independent and 


Q;/o" ts 9° (5), § =1,2,..,%, F and only es rp=n. 


Proof. First assume the two conditions 74 r; = n and 0) X2 = *Q; to be 
satisfied. The latter equation implies that J = Aj+Ao+---+A,. Let Bj = I- Aj; 
that is, B; is the sum of the matrices A;,...,A, exclusive of A;. Let R; denote 
the rank of B;. Since the rank of the sum of several matrices is less than or equal 
to the sum of the ranks, we have R; < a rj —1; =n—7;. However, I = Aj + Bi, 
so that n <r; + R; andn—r; < R;. Hence R; = n—71;. The eigenvalues of B; are 
the roots of the equation |B; — AI| = 0. Since B; = I — Aj, this equation can be 
written as |I — A; — AI| = 0. Thus we have |A; — (1 — A)I| = 0. But each root of 
the last equation is 1 minus an eigenvalue of A;. Since B; has exactly n — R; =r; 
eigenvalues that are zero, then A; has exactly r; eigenvalues that are equal to 1. 
However, r; is the rank of A;. Thus each of the r; nonzero eigenvalues of A; is 1. 
That is, A? = A; and thus Q;/o? has a y?(r;), for i = 1,2,...,k. In accordance 
with Theorem 9.9.2, the random variables Q;, Q2,...,Q are independent. 
To complete the proof of Theorem 9.9.3, take 


SX? = Q1 4+ Q2+---+Qn, 
1 


let Q1,Q2,...,Qx be independent, and let Q;/o? be x*(r;), 7 =1,2,...,k. Then 


— Qyfa* is ey rj). But ei Q;/o = ae X?/o? is x?(n). Thus i =n 
and the proof is complete. m 


EXERCISES 


9.9.1. Let X1, X2,X3 be a random sample from the normal distribution N(0, 07). 
Are the quadratic forms X? + 3X1X_+X}+X,X3+ X} and X? —2X,X2+2X3—- 
2X,X_ — X? independent or dependent? 


9.9.2. Let X1, X2,...,X, denote a random sample of size n from a distribution 
that is N(0,07). Prove that 5>j' X? and every quadratic form, that is nonidentically 
zero in X1,X2,...,Xn, are dependent. 


9.9.3. Let X1, X2,X3,X4 denote a random sample of size 4 from a distribution 
that is N(0,07). Let Y = aH a;X;, where a1, d@2,a3, and a4 are real constants. If 
Y? and Q = X,Xo — X3X4 are independent, determine a1, a2, a3, and as. 
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9.9.4. Let A be the real symmetric matrix of a quadratic form Q in the observations 
of a random sample of size n from a distribution that is N(0,07). Given that Q 
and the mean X of the sample are independent, what can be said of the elements 
of each row (column) of A? 


Hint: Are Q and 5 independent? 


9.9.5. Let A;, Ao,..., Ax be the matrices of k > 2 quadratic forms Qi, Q2,...,Qk 
in the observations of a random sample of size n from a distribution that is N (0,07). 
Prove that the pairwise independence of these forms implies that they are mutually 
independent. 

Hint: Show that A;A; = 0,7 4 j, permits Elexp(t1Qi + teQ2 +---+thQx)] to 
be written as a product of the mgfs of Q1, Qo,..., Qk. 


9.9.6. Let X’ = [X1, Xo,..., Xn], where X1, X2,..., Xn are observations of a ran- 
dom sample from a distribution that is N(0,07). Let b’ = [bi,b2,...,bn] be a 
real nonzero vector, and let A be a real symmetric matrix of order n. Prove that 
the linear form b’X and the quadratic form X’AX are independent if and only if 
b’A=0. Use this fact to prove that b’'X and X'AX are independent if and only 
if the two quadratic forms (b’X)? = X'bb'X and X'AX are independent. 


9.9.7. Let Q; and Q2 be two nonnegative quadratic forms in the observations of a 
random sample from a distribution that is N(0,07). Show that another quadratic 
form Q is independent of Q; + Q2 if and only if Q is independent of each of Q; and 
Qe. 

Hint: Consider the orthogonal transformation that diagonalizes the matrix of 
Q1 + Q2. After this transformation, what are the forms of the matrices Q,Q, and 
Qe2 if Q and Q; + Q2 are independent? 


9.9.8. Prove that Equation (9.9.12) of this section implies that the nonzero eigen- 
values of the matrices D and Do. are the same. 

Hint: Let A = 1/(2t2), te # 0, and show that Equation (9.9.12) is equivalent to 
|D — AT| = (—A)"|Do2 — AT n-+|. 

9.9.9. Here Q, and Q2 are quadratic forms in observations of a random sample from 
N(0,1). If Qi and Q2 are independent and if Q; + Q2 has a chi-square distribution, 
prove that Q; and Q»2 are chi-square variables. 


9.9.10. Often in regression the mean of the random variable Y is a linear function 
of p-values 21, @2,...,2p, Say 3121 + Bota +-+-+ByXp, where 3" = (81, Bo,..., Bp) 


are the regression coefficients. Suppose that n values, Y’ = (Yi, Y2,...,¥n), are 
observed for the z-values in X = [x;,;], where X is an n x p design matrix and its 
ith row is associated with Y;, 1 =1,2,...,n. Assume that Y is multivariate normal 


with mean X and variance-covariance matrix o?I, where I is the n x n identity 
matrix. 


(a) Note that Y;, Y2,...,Y, are independent. Why? 


(b) Since Y should approximately equal its mean X, we estimate @ by solving 
the normal equations X'Y = X'X®B for B. Assuming that X'X is non- 
singular, solve the equations to get @ = (X’X)~!X'Y. Show that @ has a 
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multivariate normal distribution with mean @ and variance—covariance matrix 
a7(xX'X)-?. 
(c) Show that 
(Y — X6)'(Y — XA) = (6 — B)'(X'X)(B — B) + (¥ — XB)'(¥ — XA), 


For the remainder of the exercise, let @ denote the quadratic form on the left 
side of this expression and Q, and @»2 denote the respective quadratic forms 
on the right side. Hence, Q = Q; + Qo. 


(d) Show that Q;/o? is y?(p). 

(e) Show that Q; and Q2 are independent. 

(f) Argue that Q2/o? is x?(n — p). 

(g) Find c so that cQi/Q2 has an F-distribution. 


(h) The fact that a value d can be found so that P(cQ1/Q2 < d) = 1—a could 
be used to find a 100(1 — a)% confidence ellipsoid for 3. Explain. 


9.9.11. Say that G.P.A. (Y) is thought to be a linear function of a “coded” high 
school rank (a2) and a “coded” American College Testing score (x3), namely, 3, + 
B2x%2 + 3323. Note that all x, values equal 1. We observe the following five points: 


Ly v2 x3 Y 
1 1 2 3 
1 4 3 6 
1 2 2 =#A 
1 4 2 =A 
1 3 2 =A 


(a) Compute X’X and GB = (X'X)"'X’Y. 
(b) Compute a 95% confidence ellipsoid for @’ = (G1, 32, G3). See part (h) of 
Exercise 9.9.10. 


9.9.12. Assume that X is an n x p matrix. Then the kernel of X is defined to be 
the space ker(X) = {b: Xb = 0}. 
(a) Show that ker(X) is a subspace of R?. 


(b) The dimension of ker(X) is called the nullity of X and is denoted by v(X). 
Let p(X) denote the rank of X. A fundamental theorem of linear algebra says 
that p(X) + v(X) = p. Use this to show that if X has full column rank, then 
ker(X) = {0}. 


9.9.13. Suppose X is an n x p matrix with rank p. 
(a) Show that ker(X’X) = ker(X). 


(b) Use part (a) and the last exercise to show that if X has full column rank, then 
X’X is nonsingular. 


Chapter 10 


Nonparametric and Robust 
Statistics 


10.1 Location Models 


In this chapter, we present some nonparametric procedures for the simple location 
problems. As we shall show, the test procedures associated with these methods 
are distribution-free under null hypotheses. We also obtain point estimators and 
confidence intervals associated with these tests. The distributions of the estimators 
are not distribution-free; hence, we use the term rank-based to refer collectively to 
these procedures. The asymptotic relative efficiencies of these procedures are easily 
obtained, thus facilitating comparisons among them and procedures that we have 
discussed in earlier chapters. We also obtain estimators that are asymptotically 
efficient; that is, they achieve asymptotically the Rao—Cramér bound. 

Our purpose is not a rigorous development of these concepts, and at times we 
simply sketch the theory. A rigorous treatment can be found in several advanced 
texts, such as Randles and Wolfe (1979) or Hettmansperger and McKean (2011). 
For an applied discussion using R, see Kloke and McKean (2014). 

In this and the following section, we consider the one-sample problem. For the 
most part, we consider continuous random variables X with cdf and pdf Fx (x) 
and fx(x), respectively. We assume that fx(a) > 0 on the support of X; so, in 
particular, F'y(x) is strictly increasing on the support. In this and the succeeding 
chapters, we want to identify classes of parameters. Think of a parameter as a 
function of the cdf (or pdf) of a given random variable. For example, consider the 
mean pt of X. We can write it as ux = T (Fx) if T is defined as 


T(Fx) = E(X). 


As another example, recall that the median of a random variable X is a parameter 
€ such that Fy (€) = 1/2; ie., € = Fe ge), Hence, in this notation, we say that 
the parameter € is defined by the function T(Fx) = Fx'(1/2). Note that these T's 
are functions of the cdfs (or pdfs). We shall call them functionals. 


569 


570 Nonparametric and Robust Statistics 


Remark 10.1.1 (Natural Nonparametric Estimators). Functionals induce non- 
parametric estimators naturally. Let X,, X2,...,X, denote a random sample from 
some distribution with cdf F(a) and let T(F) be a functional. Let x71, 22,...,a be 
a realization of this sample. Recall that the empirical distribution function of the 
sample is given by 


F(x) =n" [#{a; < x}], -co <@ < 00. (10.1.1) 


Hence, F), is a discrete cdf that puts mass (probability) 1/n at each x;. Because 


n 


F(x) is a cdf, T(F,,) is well defined. Furthermore, T(f;,) depends only on the 
sample; hence, it is a statistic. We call T (Fy) the induced estimator of T(F). 
For example, if T(F') is the mean of the distribution, then it is easy to see that 
T(F,) = F; see Exercise 10.1.3. 

For another example, consider the median. Note that F, is a discrete cdf; hence, 
we use the general definition of a median of a distribution that is given in Definition 
1.7.2 of Chapter 1. Let 6 denote the usual sample median which is defined in 
expression (4.4.4); that is, 9 = %((m41)/2) ifn is odd while 6= [@(n/2) +2 ((n/2)+1)]/2 


if n is even. To show that 6 satisfies Definition 1.7.2, note that: 
e If n is even, then F,,(6) = 1/2. 
e If n is odd then 


n#{a; < 6} = 4-4 <1/2 and F,(6) > 1/2. 


Thus in either case, by Definition 1.7.2, 6 is a median of F',. Note that when n 
is even any point in the interval (X(n/2),X((n/2)+1)) satisfies the definition of a 
median. 


We begin with the definition of a location functional. 


Definition 10.1.1. Let X be a continuous random variable with cdf Fx (a) and pdf 
fx(x). We say that T(Fx) is a location functional if it satisfies 


IfY =X +a, then T(Fy) =T(Fx) +a, forallae R, (10.1.2) 
If Y =aX; then T(Fy) = aT (Fx), for alla 40. (10.1.3) 


For example, suppose T is the mean functional; i.e., T(F’y) = E(X). Let 
Y = X +a; then E(Y) = E(X +a) = E(X) +a. Secondly, if Y = aX, then 
E(Y) = aE(X). Hence the mean is a location functional. The next example shows 
that the median is a location functional. 


Example 10.1.1. Let F(x) be the cdf of X and let T(Fx) = Fy'(1/2) be the 
median functional of X. Note that another way to state this is Fy (T(Fx)) = 1/2. 
Let Y = X +a. It then follows that the cdf of Y is Fy(y) = Fx(y-— a). The 
following identity shows that T(Fy) = T(Fx) +a: 


Fy (T(Fx) +a) = Fy (T(Fx) +a— a) = Fx (T(Fx)) = 1/2: 
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Next, suppose Y = aX. If a > 0, then Fy(y) = Fx (y/a) and, hence, 
Fy (aT(Fx)) = Fy (aT(Fx)/a) = Fy (T(Fx)) = 1/2. 


Thus T(Fy) = aT(Fx) when a > 0. On the other hand, if a < 0, then Fy(y) = 
1— Fx(y/a). Hence 


Fy (aT (Fx)) = — Fy (aT (Fx)/a) =1- Fx (T(Fx)) =1- 


Therefore, (10.1.3) holds for all a 4 0. Thus the median is a location functional. 

Recall that the median is a percentile, namely, the 50th percentile of a distribu- 
tion. As Exercise 10.1.1 shows, the median is the only percentile that is a location 
functional. m 


We often continue to use parameter notation to denote functionals. For example, 
Ox =T (Fx). 

In Chapters 4 and 6, we wrote the location model for specified pdfs. In this 
chapter, we write it for a general pdf in terms of a specified location functional. 
Let X be a random variable with cdf Fx (a) and pdf fx(x). Let 6x =T(Fx) bea 
location functional. Define the random variable ¢ to be « = X — T(Fx). Then by 
(10.1.2), T( FZ) = 0; ie., e has location 0, according to T. Further, the pdf of X 
can be written as fx(x) = f(a — T(Fx)), where f(x) is the pdf of e. 


Definition 10.1.2 (Location Model). Let 0x = T(Fx) be a location functional. We 


say that the observations X1, X2,...,Xn follow a location model with functional 
Ox =T(Fx) if 

Hd re, (10.1.4) 
where €1,€2,..-,€n are tid random variables with pdf f(a) and T(F-) = 0. Hence, 


from the above discussion, X1,X2,...,Xn are tid with pdf fx(x) = f(a — T(Fx)). 


Example 10.1.2. Let ¢ be a random variable with cdf F(x), such that F(0) = 1/2. 
Assume that €1,€2,...,€n are iid with cdf F(a). Let 6 € R and define 


X,=O+e;, 1=1,2,...,n. 


Then X1,X2,...,X, follow the location model with the locational functional @, 
which is the median of X;. 


Note that the location model very much depends on the functional. It forces one 
to state clearly which location functional is being used in order to write the model 
statement. For the class of symmetric densities, though, all location functionals are 
the same. 


Theorem 10.1.1. Let X be a random variable with cdf Fx (a) and pdf fx (a) such 
that the distribution of X is symmetric about a. Let T(Fx) be any location func- 
tional. Then T(F'x) =a. 
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Proof: By (10.1.2), we have 
Tea =T Fo) - a (10.1.5) 


Since the distribution of X is symmetric about a, it is easy to show that X — a and 
—(X —a) have the same distribution; see Exercise 10.1.2. Hence, using (10.1.2) and 
(10.1.3), we have 


T(Fx-a) = T(F_(x-a)) = —(T(Fx) — 4) = -T(Fx) +. (10.1.6) 


Putting (10.1.5) and (10.1.6) together gives the result. m 
The assumption of symmetry is very appealing, because the concept of “center” 
is unique when it is true. 


EXERCISES 


10.1.1. Let X be a continuous random variable with cdf F(a). For 0 < p < 1, let 
&) be the pth quantile; ie., F(€,) = p. If p 1/2, show that while property (10.1.2) 
holds, property (10.1.3) does not. Thus €, is not a location parameter. 


10.1.2. Let X be a continuous random variable with pdf f(a). Suppose f(z) is 
symmetric about a; ie., f(a — a) = f(—(a# —a)). Show that the random variables 
X —aand —(X — a) have the same pdf. 


10.1.3. Let F, F(a) denote the empirical cdf of the sample Xj, X2,...,Xn. The 
distribution of es (x) puts mass 1/n at each sample item X;. Show that its mean is 
X. If T(F) = F~!(1/2) is the median, show that T(F,) = Q2, the sample median. 


10.1.4. Let X be a random variable with cdf F(a) and let T(F) be a functional. 
We say that T(/’) is a scale functional if it satisfies the three properties 


(i) T(Fax) = aT(Fx), fora>0 
(it) T(Fx+0) T (Fx), for all b 
(iii) Tex) = T(Fx). 


l| 


Show that the following functionals are scale functionals. 
(a) The standard deviation, T(Fx) = (Var(X))!/?. 
(b) The interquartile range, T(Fx) = Fy'(3/4) — Fx* (1/4). 


10.2 Sample Median and the Sign Test 


In this section, we consider inference for the median of a distribution using the 
sample median. Fundamental to this discussion is the sign test statistic, which we 
present first. 

Let X1, X2,..., Xn be a random sample that follows the location model 


Xi =O +6, (10.2.1) 
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where €1,€2,...,€n are iid with cdf F(x), pdf f(x), and median 0. Note that in 
terms of Section 10.1, the location functional is the median and, hence, 0 is the 
median of X;. We begin with a test for the one-sided hypotheses 


Hy: 0 =o versus H,: 6 > 6. (10.2.2) 


Consider the statistic 
S(00) = #{Xi > Ao}, (10.2.3) 


which is called the sign statistic because it counts the number of positive signs in 
the differences X; — 09, i = 1,2,...,n. If we define I(x > a) to be 1 or 0 depending 
on whether x > a or x < a, then we can express $89) as 


S80) = SX = Oo). (10.2.4) 


Note that if Hp is true, then we expect one half of the observations to exceed 6, 
while if H, is true, we expect more than half of the observations to exceed 6p. 
Consider then the test of the hypotheses (10.2.2) given by 


Reject Ho in favor of H, if S(09) > c. (10.2.5) 


Under the null hypothesis, the random variables I(X; > 99) are iid with a Bernoulli 
b(1, 1/2) distribution. Hence the null distribution of $(09) is b(n,1/2) with mean 
n/2 and variance n/4. Note that under Ho, the sign test does not depend on the 
distribution of X;. In general, we call such a test a distribution free test. 

For a level a test, select c to be ca, where cq is the upper a critical point of a bi- 
nomial b(n, 1/2) distribution. The test statistic, though, has a discrete distribution, 
so for an exact test there are only a finite number of levels a available. The values 
of cq are easily found by most computer packages. For instance, the R command 
pbinom(0:15,15,.5) returns the cdf of a binomial distribution with n = 15 and 
p = 0.5, from which all possible levels can be seen. 

For a given data set, the p-value associated with the sign test is given by p = 
Py,(S(00) > s), where s is the realized value of S(@9) based on the sample. For 
computation, the R command 1 - pbinom(s-1,n,.5) computes p. 

It is convenient at times to use a large sample test based on the asymptotic 
distribution of the test statistic. By the Central Limit Theorem, under Ho the stan- 
dardized statistic [S(@9) — (n/2)]/./n/2 is asymptotically normal, N (0,1). Hence 
the large sample test rejects Ho if 


= Bei (10.2.6) 


see Exercise 10.2.2. 
We briefly touch on the two-sided hypotheses given by 


Hy: 6 = versus Hy: 0 4 Op. (10.2.7) 
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The following symmetric decision rule seems appropriate: 
Reject Ho in favor of Hy if S(@9) < «1 or if S(00) >n-— cy. (10.2.8) 


For a level a test, c; would be chosen such that a/2 = Py,(S(60) < c1). Recall 
that the p-value is given by p = 2 min{Py,($(00) < s), Pu, (S(00) > s)}, where s is 
the realized value of S'\(@)) based on the sample. 


Example 10.2.1 (Shoshoni Rectangles). A golden rectangle is a rectangle in which 
the ratio of the width (w) to the length (1) is the golden ratio, which is approxi- 
mately 0.618. It can be characterized in various ways. For example, w/! = 1/(w+1) 
characterizes the golden rectangle. It is considered to be an aesthetic standard in 
Western civilization and appears in art and architecture going back to the ancient 
Greeks. It now appears in such items as credit and business cards. In a cultural 
anthropology study, DuBois (1960) reports on a study of the Shoshoni beaded bas- 
kets. These baskets contain beaded rectangles, and the question was whether the 
Shoshonis use the same aesthetic standard as the West. Let X denote the ratio of 
the width to the length of a Shoshoni beaded basket. Let @ be the median of X. 
The hypotheses of interest are 


Ho: 0=0.618 versus Hy: 6 4 0.618. 


These are two-sided hypotheses. It follows from the above discussion that the sign 
test rejects Ho in favor of H, if S(0.618) < c or $(0.618) > n—-c. 

A sample of 20 width to length (ordered) ratios from Shoshoni baskets resulted 
in the data 


Width-to-Length Ratios of Rectangles 
0.553 0.570 0.576 0.601 0.606 0.606 0.609 0.611 0.615 0.628 


0.654 0.662 0.668 0.670 0.672 0.690 0.693 0.749 0.844 0.933 


The data can be found in the file shoshoni.rda. For these data, the sign test statis- 
tic is $(0.618) = 11. Using R the p-value is: 2*(1-pbinom(10,20,.5))= 0.8238. 
Thus there is no evidence to reject Ho based on these data. 

A boxplot and a normal q—q plot of the data are given in Figure 10.2.1. Notice 
that the data contain two, possibly three, potential outliers. The data do not appear 
to be drawn from a normal distribution. 


We next obtain several useful results concerning the power function of the sign 
test for the hypotheses (10.2.2). The following function proves useful here and in 
the associated estimation and confidence intervals described below. Define 


S(0) = #{X; > 9}. (10.2.9) 


The sign test statistic is given by (09). We can easily describe the function S(0). 
First, note that we can write it in terms of the order statistics Yj < --- < Y, of 
X1,...,Xpn because #{Y; > 0} = #{X; > 6}. Now if 6 < Yj, then all the Yjs 
are larger than 6 and, hence S(0) = n. Next, if Y; < 0 < Y2 then S(@) =n-1. 
Continuing this way, we see that S(0) is a decreasing step function of 6, which steps 
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Figure 10.2.1: Boxplot (Panel A) and normal g—g plot (Panel B) of the Shoshoni 
data. 


down one unit at each order statistic Y;, attaining its maximum and minimum values 
n and 0 at Y; and Y,,, respectively. Figure 10.2.2 depicts this function. 

We need the following translation property. Because we can always subtract 6 
from each X;, we can assume without loss of generality that 69 = 0. 


Lemma 10.2.1. For every k, 
Po{S(0) > k] = Po[S(—6) > kl. (10.2.10) 


Proof: Note that the left side of equation (10.2.10) concerns the probability of the 
event #{X; > 0}, where X; has median 0. The right side concerns the probability 
of the event #{(X; + 0) > 0}, where the random variable X; + @ has median 0 
(because under 6 = 0, X; has median 0). Hence the left and right sides give the 
same probability. m 
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Figure 10.2.2: The sketch shows the graph of the decreasing step function S(6). 
The function drops one unit at each order statistic Y;. 


Based on this lemma, it is easy to show that the power function of the sign test 
is monotone for one-sided tests. 


Theorem 10.2.1. Suppose Model (10.2.1) is true. Let y(@) be the power function 
of the sign test of level a for the one-sided hypotheses (10.2.2). Then y(0) is a 
nondecreasing function of 0. 


Proof: Let cq denote the b(n,1/2) upper critical value as defined after expression 
(10.2.8). Without loss of generality, assume that 69 = 0. The power function of the 
sign test is 

(8) = Po[S(0) > ca], for —c00o <0 < oo. 


Suppose 6; < 62. Then —6; > —@2 and hence, since S(@) is nonincreasing, S(—0@1) < 
S(—62). This and Lemma 10.2.1 yield the desired result; i.e., 


(91) 


| 
= 
pa 
= 
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This is a very desirable property for any test. Because the monotonicity of the 
power function of the sign test holds for all 86, —oo < @ < oo, we can extend the 
simple null hypothesis of (10.2.2) to the composite null hypothesis 


Ho: @< 09 versus H, : 0 > 6. (10.2.11) 
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Recall from Definition 4.5.4 of Chapter 4 that the size of the test for a composite 
null hypothesis is given by maxg<o, y(0). Because +(@) is nondecreasing, the size of 
the sign test is a for this extended null hypothesis. As a second result, it follows 
immediately that the sign test is an unbiased test; see Section 8.3. As Exercise 
10.2.8 shows, the power function of the sign test for the other one-sided alternative, 
A: @< 0, is nonincreasing. 


Under an alternative, say 9 = 61, the test statistic S(@9) has the binomial 
distribution b(n,p1), where p; is given by 


pi = Po, (X > 0) =1— F(-6,), (10:9.19) 


where F(x) is the cdf of ¢ in Model (10.2.1). Hence $(@) is not distribution free 
under alternative hypotheses. As in Exercise 10.2.3, we can determine the power of 
the test for specified 6; and F(a). We want to compare the power of the sign test 
to other size a tests, in particular the test based on the sample mean. However, 
for these comparison purposes, we need more general results, some of which are 
obtained in the next subsection. 


10.2.1 Asymptotic Relative Efficiency 


One solution to this problem is to consider the behavior of a test under a sequence 
of local alternatives. In this section, we often take 69 = 0 in hypotheses (10.2.2). As 
noted before Lemma 10.2.1, this is without loss of generality. For the hypotheses 
(10.2.2), consider the sequence of alternatives 


Ho: 6=O versus Hin: On = (10.2.13) 


i 
Vn? 


where 6 > 0. Note that this sequence of alternatives converges to the null hypothesis 
as n — oo. We often call such a sequence of alternatives local alternatives. The 
idea is to consider how the power function of a test behaves relative to the power 
functions of other tests under this sequence of alternatives. We only sketch this 
development. For more details, the reader can consult the more advanced books 
cited in Section 10.1. As a first step in that direction, we obtain the asymptotic 
power lemma for the sign test. 


Consider the large sample size a test given by (10.2.6). Under the alternative 
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#,, we can approximate the mean of this test as follows: 


= a UX > —6,)| — vn 

7 7 les > —6,) — va 

= vii (1 F(t.) ~ 5) 

= vn (5 _ F(-6n)) 

wz /nOnf(0) = 6f (0), (10.2.14) 


where the step to the last line is due to the mean value theorem. It can be shown 
in more advanced texts that the variance of [S'\(0) — (n/2)]/(,/7/2) converges to 1 
under 6,,, just as under Ho, and that, furthermore, [S'(0) — (n/2) — /n6 f (0)|/(./7/2) 
has a limiting standard normal distribution. This leads to the asymptotic power 
lemma, which we state in the form of a theorem. 


Theorem 10.2.2 (Asymptotic Power Lemma). Consider the sequence of hypotheses 
(10.2.18). The limit of the power function of the large sample, size a, sign test is 


lim (On) = 1—- ®(za — 675"), (10.2.15) 


where Tg = 1/[2f(0)] and ®(z) is the cdf of a standard normal random variable. 


Proof: Using expression (10.2.14) and the discussion that followed its derivation, 
we have 


n71/2 —(n 
(0x) = Pr, [ISO PA, 
ni/2 —(n —/n 
= P,, en ee aol vind $(0)) > Zq — 62f(0) 


— 1-(z, — 62f(0)), 
which was to be shown. @ 


As shown in Exercise 10.2.5, the parameter Ts = 1/[2f(0)] is a scale parameter 
(functional) as defined in Exercise 10.1.4 of the last section. We later show that 
Ts/./n is the asymptotic standard deviation of the sample median. 

Note that there were several approximations used in the proof of Theorem 10.2.2. 
A rigorous proof can be found in more advanced texts, such as those cited in Section 
10.1. It is quite helpful for the next sections to reconsider the approximation of the 
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mean given in (10.2.14) in terms of another concept called efficacy. Consider 
another standardization of the test statistic given by 


5(0) = a 1(X; > 0), (10.2.16) 


where the bar notation is used to signify that S(O) is an average of I (X;_> 0) and, 
in this case under Ho, converges in probability to 5. Let (0) = Eo(S(0) 
Then, by expression (10.2.14), we have 


(On) = Eo, (500) = 5) sis Ho, (10.2.17) 


Let oz = Var(S(0)) = 4. Finally, define the efficacy of the sign test to be 


/ 
oe tg Oe (10.2.18) 
n—- co noy 
That is, the efficacy is the rate of change of the mean of the test statistic at the 
null divided by the product of \/n and the standard deviation of the test statistic 
at the null. So the efficacy increases with an increase in this rate, as it should. We 
use this formulation of efficacy throughout this chapter. 
Hence, by expression (10.2.14), the efficacy of the sign test is 


£0) 
13 


2f(0) = 75°, (10.2.19) 


the reciprocal of the scale parameter tg. In terms of efficacy, we can write the 
conclusion of the Asymptotic Power Lemma as 
lim 7(6n) = 1 — ®(za — des). (10.2.20) 
n—Co 
This is not a coincidence, and it is true for the procedures we consider in the next 
section. 


Remark 10.2.1. In this chapter, we compare nonparametric procedures with tra- 
ditional parametric procedures. For instance, we compare the sign test with the test 
based on the sample mean. Traditionally, tests based on sample means are referred 
to as t-tests. Even though our comparisons are asymptotic and we could use the 
terminology of z-tests, we instead use the traditional terminology of t-tests. m 


As a second illustration of efficacy, we determine the efficacy of the t-test for the 
mean. Assume that the random variables ¢; in Model (10.2.1) are symmetrically 
distributed about 0 and their mean exists. Hence the parameter 6 is the location 
parameter. In particular, 9 = E(X;) = med(X;). Denote the variance of X; by 07. 
This allows us to easily compare the sign and t-tests. Recall for hypotheses (10.2.2) 
that the t-test rejects Ho in favor of Hy if X > c. The form of the test statistic is 
then X. Furthermore, we have 


jizz (0) = Eo(X) = 0 (10.2.21) 
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and 
2 


i= oS — 2. 
o%(0) = Vo(X) = — (10.2.22) 
Thus, by (10.2.21) and (10.2.22), the efficacy of the t-test is 


p5-(0) 1 
— Jim ea e (10.2.23) 
As confirmed in Exercise 10.2.9, the asymptotic power of the large sample level a, 
t-test under the sequence of alternatives (10.2.13) is 1 — ®(zq — dc,). Thus we can 
compare the sign and t-tests by comparing their efficacies. We do this from the 
perspective of sample size determination. 

Assume without loss of generality that Hp : @ = 0. Now suppose we want 
to determine the sample size so that a level a sign test can detect the alternative 
0* > 0 with (approximate) probability y*. That is, find n so that 


5(0) — (n/2) 
vn = * 


Write 0* = \/n6*/,/n. Then, using the asymptotic power lemma, we have 
y= y(/n6" /V/n) & 1 - ®(zq — Vn" 75"). 


Now denote z,* to be the upper 1—-y* quantile of the standard normal distribution. 
Then, from this last equation, we have 


* —1 
Zy* = Ziq — Vn" Te. 


Ct 


y* = y(0") = Po- (10.2.24) 


Solving for n, we get 
2 
ng = oe (10.2.25) 
As outlined in Exercise 10.2.9, for this situation the sample size determination for 
the test based on the sample mean is 


— (225 (10.2.26) 


where o? = Var(e). 

Suppose we have two tests of the same level for which the asymptotic power 
lemma holds and for each we determine the sample size necessary to achieve power 
7* at the alternative 6*. Then the ratio of these sample sizes is called the asymp- 
totic relative efficiency (ARE) between the tests. We show later that this is the 
same as the ARE defined in Chapter 6 between estimators. Hence the ARE of the 
sign test to the t-test is 


bo 
Ls) 


AREG 72 =e, (10.2.27) 
ng 


2 
Tg Cy 
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Note that this is the same relative efficiency that was discussed in Example 6.2.5 
when the sample median was compared to the sample mean. In the next two 
examples we revisit this discussion by examining the AREs when X; has a normal 
distribution and then a Laplace (double exponential) distribution. 


Example 10.2.2 (ARE(S,t): normal distribution). Suppose X1, X2,..., Xn follow 
the location model (10.1.4), where f(x) is a N(0,07) pdf. Then rg = (2f(0))~? = 
a/7/2. Hence the ARE(S,t) is given by 


oa? o 2 


Hence at the normal distribution the sign test is only 64% as efficient as the t-test. 
In terms of sample size at the normal distribution, the t-test requires a smaller 
sample, 0.64n,, where n, is the sample size of the sign test, to achieve the same 
power as the sign test. A cautionary note is needed here because this is asymptotic 
efficiency. There have been ample empirical (simulation) studies that give credence 
to these numbers. 


Example 10.2.3 (ARE(S,t) at the Laplace distribution). For this example, con- 
sider Model (10.1.4), where f(a) is the Laplace pdf f(x) = (2b)~1 exp{—|a|/b} for 
—co <x < oo and b> 0. Then Ts = (2f(0))-! = 0, while o? = E(X?) = 207. 
Hence the ARE(S,t) is given by 


2 2 2 
ARE(S,t) = “ = * = 2. (10.2.29) 
S 


So, at the Laplace distribution, the sign test is (asymptotically) twice as efficient 
as the t-test. In terms of sample size at the Laplace distribution, the t-test requires 
twice as large a sample as the sign test to achieve the same asymptotic power as 
the sign test. 

Recall from Example 6.3.4 that the sign test is the scores type likelihood ratio 
test when the true distribution is the Laplace. m 


The normal distribution has much lighter tails than the Laplace distribution, 
because the two pdfs are proportional to exp{—t?/207} and exp{—|t|/b}, respec- 
tively. Based on the last two examples, it seems that the t-test is more efficient 
for light-tailed distributions while the sign test is more efficient for heavier-tailed 
distributions. This is true in general and we illustrate this in the next example 
where we can easily vary the tail weight from light to heavy. 


Example 10.2.4 (ARE(S,t) at a family of contaminated normals). Consider the 
location Model (10.1.4), where the cdf of ¢; is the contaminated normal cdf given 
in expression (3.4.19). Assume that 69 = 0. Recall that for this distribution, (1 —) 
proportion of the time the sample is drawn from a N(0,67) distribution, while € 
proportion of the time the sample is drawn from a N(0,b?o2) distribution. The 
corresponding pdf is given by 


l-e £ € x 
f(z) =——¢ (=) + bon? (=) ; (10.2.30) 
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where ¢(z) is the pdf of a standard normal random variable. 7) shown in Section 
3.4, the variance of ¢; is b?(1 + €(o? —1)). Also, tT, = (b\/7/2)/[1 — € + (€/a-)]. 
Thus the ARE is 


ARE(S, t) = *ia + e(o? — 1)][1—e€ + (€/oe)]?. (10.2.31) 


For example, the following table (see Exercise 6.2.6) shows the AREs for various 
values of € when a, is set at 3.0: 


a 


a 


Note: if € increases over the range of values in the table, then the contamination 
effect becomes larger (generally resulting in a heavier-tailed distribution) and as the 
table shows, the sign test becomes more efficient relative to the t-test. Increasing 
O- has the same effect. It does take, however, with ao, = 3, over 10% contamination 
before the sign test becomes more efficient than the t-test. m 


10.2.2 Estimating Equations Based on the Sign Test 


In practice, we often want to estimate 0, the median of X;, in Model (10.2.1). The 
associated point estimate based on the sign test can be described with a simple 
geometry, which is analogous to the geometry of the sample mean. As Exercise 
10.2.6 shows, the sample mean X is such that 


(10.2.32) 


The quantity \/>0;_,(Xi — 4)? is the Euclidean distance between the vector of 
observations X = (X 1, X2,...,Xn)/ and the vector 61. If we simply interchange 
the square root and the summation symbols, we go from the Euclidean distance to 
the L, distance. Let 


6 = Argmin ) - |X; — 6]. (10.2.33) 


i=l 


To determine 6, simply differentiate the quantity on the right side with respect to 
@ (as in Chapter 6, define the derivative of |x| to be 0 at = 0). We then obtain 


fa) n n 
ap oe |X; — 6] = — 5° sgn(X; — 
w=1 w=1 


Setting this to 0, we obtain the estimating equations (EE) 


y sgn(X; — 0) =0, (10.2.34) 
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whose solution is the sample median Qa, (4.4.4). 
Because our observations are continuous random variables, we have the identity 


5 eet — 6) = 25(8) —n. 


Hence the sample median also solves S(0) ~ n/2. Consider again Figure 10.2.2. 
Imagine n/2 on the vertical axis. This is halfway in the total drop of S(@), from n 
to 0. The order statistic on the horizontal axis corresponding to n/2 is essentially 
the sample median (middle order statistic). In terms of testing, this last equation 
says that, based on the data, the sample median is the “most acceptable” hypothesis, 
because n/2 is the null expected value of the test statistic. We often think of this 
as estimation by the inversion of a test. 

We now sketch the asymptotic distribution of the sample median. Assume with- 
out loss of generality that the true median of X; is 0. Suppose —oco < x < ow. 
Using the fact that $(@) is nonincreasing and the identity S(0) + n/2, we have the 
following equivalences: 


wae sain arse} {6(&) sh 


Hence we have 


Po(VnQ2 <2) = Po Is (+) <5 


where Z has a standard normal distribution, Notice that the limit was obtained 
by invoking the Asymptotic Power Lemma with a = 0.5 and hence z, = 0. Rear- 
ranging the last term earlier, we obtain the asymptotic distribution of the sample 
median, which we state as a theorem: 


Theorem 10.2.3. For the random sample X1,X2,...,Xn, assume that Model 
(10.2.1) holds. Suppose that f(0) > 0. Let Q2 denote the sample median. Then 


Vn(Q2 — 0) > N(0,73), (10.2.35) 
where Ts = (2f(0))~t. 


In Section 6.2 we defined the ARE between two estimators to be the reciprocal 
of their asymptotic variances. For the sample median and mean, this is the same 
ratio as that based on sample size determinations of their respective tests given 
earlier in expression (10.2.27). 
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10.2.3. Confidence Interval for the Median 


In Section 4.4, we obtained a confidence interval for the median. In this section, we 
derive this confidence interval by inverting the sign test. Based on the monotonicity 
of S(@), the derivation is straightforward, but the technique will prove useful in 
subsequent sections of this chapter. 
Suppose the random sample X1, X2,..., Xn follows the location model (10.2.1). 
In this subsection, we develop a confidence interval for the median 0 of X;. Assum- 
ing that 0 is the true median, the random variable S(@), (10.2.9), has a binomial 
b(n, 1/2) distribution. For 0 < a < 1, select c, so that Pg[S(0) < ci] = a/2. Hence 
we have 
l1-—a=Polci < S(0) <n—c)]. (10.2.36) 


Recall in our derivation for the t-confidence interval for the mean in Chapter 3, 
we began with such a statement and then “inverted” the pivot random variable 
t = /n(X — y)/S (S in this expression is the sample standard deviation) to obtain 
an equivalent inequality with p isolated in the middle. In this case, the function 
S(@) does not have an inverse, but it is a decreasing step function of 6 and the 
inversion can still be performed. As depicted in Figure 10.2.2, c, < S(@) <<n—c, if 
and only if Yo41 <6 < Yn—e,, where Y; < Yo < +--+ < Y, are the order statistics of 
the sample X), X2,..., Xp. Therefore, the interval [Y.,+41, Yn—c,) is a (1 — a)100% 
confidence interval for the median @. Because the order statistics are continuous 
random variables, the interval (Yc, +1, Yn—c,) is an equivalent confidence interval. 
If n is large, then there is a large sample approximation to c,;. We know from 
the Central Limit Theorem that $(@) is approximately normal with mean n/2 and 
variance n/4. Then, using the continuity correction, we obtain the approximation 


ne = (10.2.37) 


where ®(—z,/2) = a/2; see Exercise 10.2.7. In practice, we use the closest integer 
to cy. 


Example 10.2.5 (Example 10.2.1, Continued). There are 20 data points in the 
Shoshoni basket data. The sample median of the width to the length is 0.5(0.628 + 
0.654) = 0.641. Because 0.021 = Py, (.S(0.618) < 5), a 95.8% confidence interval 
for 6 is the interval (ye, yi5) = (0.606, 0.672), which includes 0.618, the ratio of the 
width to the length, which characterizes the golden rectangle. 

Currently, there is not an intrinsic R function for the one-sample sign analysis. 
The R function onesampsgn.R, which can be downloaded at the site listed in the 
Preface, computes this analysis. For these data, its default 95% confidence interval 
is the same as that computed above. 


EXERCISES 


10.2.1. Sketch Figure 10.2.2 for the Shoshoni basket data found in Example 10.2.1. 
Show the values of the test statistic, the point estimate, and the 95.8% confidence 
interval of Example 10.2.5 on the sketch. 
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10.2.2. Show that the test given by (10.2.6) has asymptotically level a; that is, 
show that under Ho, 
S(90) — (n/2) p 
SUAVE 2 
J/n/2 


where Z has a N(0, 1) distribution. 


4 


10.2.3. Let @ denote the median of a random variable X. Consider testing 
Ho: 6=0 versus H,: 6>0. 
Suppose we have a sample of size n = 25. 


(a) Let 5(0) denote the sign test statistic. Determine the level of the test: reject 
Ho if S(0) > 16. 


(b) Determine the power of the test in part (a) if X has N(0.5,1) distribution. 


(c) Assuming X has finite mean ys = 0, consider the asymptotic test of rejecting 
Ho if X/(o//n) > k. Assuming that o = 1, determine k so the asymptotic 
test has the same level as the test in part (a). Then determine the power of 
this test for the situation in part (b). 


10.2.4. To appreciate the importance of setting the location functional, consider 
the length of rivers data set, as taken from Tukey (1977). This data set con- 
tains the lengths of 141 American rivers in miles and it can be found in the file 
lengthriver.rda. 


(a) Suppose the location functional is the median. Obtain the estimate and a 
95% confidence interval for it. Use the confidence interval discussed in Section 
10.2.3. Interpret it in terms of the data. Use the R function onesampsgn.R 
for computation. 


(b) Suppose the location functional is the mean. Obtain the estimate and the 
95% t-confidence interval for it. Interpret it in terms of the data. 


(c) Obtain the boxplot of the data and sketch the estimates and confidence inter- 
vals on it. Discuss. 


10.2.5. Recall the definition of a scale functional given in Exercise 10.1.4. Show 
that the parameter ts defined in Theorem 10.2.2 is a scale functional. 


10.2.6. Show that the sample mean solves Equation (10.2.32). 
10.2.7. Derive the approximation (10.2.37). 


10.2.8. Show that the power function of the sign test is nonincreasing for the 
hypotheses 
Hoy: 0 =o versus Hy: 0 < 6. (10.2.38) 
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10.2.9. Let X), X2,...,X, be a random sample that follows the location model 
(10.2.1). In this exercise we want to compare the sign tests and t-test of the hy- 
potheses (10.2.2); so we assume the random errors ¢; are symmetrically distributed 
about 0. Let o? = Var(e;). Hence the mean and the median are the same for this 
location model. Assume, also, that 69 = 0. Consider the large sample version of 
the t-test, which rejects Ho in favor of Hy; if X/(a//n) > Za. 


(a) Obtain the power function, ¥;(0), of the large sample version of the t-test. 
(b) Show that ¥;(0) is nondecreasing in 0. 


(c) Show that 74(@n) — 1 — ®(z_ — 7*), under the sequence of local alternatives 
(10.2.13). 


(d) Based on part (c), obtain the sample size determination for the t-test to detect 
6* with approximate power y*. 


(e) Derive the ARE(S,t) given in (10.2.27). 


10.3. Signed-Rank Wilcoxon 


Let X1, X2,...,Xn be arandom sample that follows Model (10.2.1). Inference for 6 
based on the sign test is simple and requires few assumptions about the underlying 
distribution of X;. On the other hand, sign procedures have the low efficiency of 0.64 
relative to procedures based on the t-test given an underlying normal distribution. 
In this section, we discuss a nonparametric procedure that does attain high efficiency 
relative to the t-test. We make the additional assumption that the pdf f(a) of e; 
in Model (10.2.1) is symmetric; i.e., f(x) = f(—«), for all 2 such that —oo < 4 < 
co. Hence X; is symmetrically distributed about @. Thus, by Theorem 10.1.1, all 
location parameters are identical. 
First, consider the one-sided hypotheses 


Ho: 6=0 versus Hy: @>0. (10.3.1) 


There is no loss of generality in assuming that the null hypothesis is Hp : 6 = 0, for 
if it were Hp : 6 = 09, we would consider the sample X; — 60,..., Xn — 989. Under 
a symmetric pdf, observations X; that are the same distance from 0 are equilikely 
and hence should receive the same weight. A test statistic that does this is the 
signed-rank Wilcoxon given by 


T = 5° sen(X;)R[Xi|, (10.3.2) 
i=1 
where R|X;| denotes the rank of X; among |X,|,...,|Xn|, where the rankings are 


from low to high. Intuitively, under the null hypothesis, we expect half of the X;s to 
be positive and half to be negative. Further, the ranks are uniformly distributed on 
the integers {1,2,...,n}. Hence values of T around 0 are indicative of Hp. On the 
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other hand, if Hy; is true, then we expect more than half of the X;s to be positive 
and further, the positive observations are more likely to receive the higher ranks. 
Thus an appropriate decision rule is 


Reject Ho in favor of H; if T > c, (10.3.3) 


where c is determined by the level a of the test. 

Given a, we need the null distribution of T to determine the critical point c. The 
set of integers {—n(n + 1)/2, —[n(n + 1)/2] + 2,...,n(m +1)/2} form the support 
of T. Also, from Section 10.2, we know that the signs are iid with support {—1, 1} 
and pmf 


(10.3.4) 
A key result is the following lemma: 


Lemma 10.3.1. Under Hp and symmetry about 0 for the pdf, the random variables 
|X4|,.-.,|Xn] are independent of the random variables sgn(X1),...,sgn(Xn). 


Proof: Because X1,...,Xp is a random sample from the cdf F(x), it suffices to 
show that P[|X;| < 2, sgn(X;) = 1] = P{|X;i| < «]P[sgn(X;) = 1]. Due to Hp and 
the symmetry of f(a), this follows from the following string of equalities 


P||X;| <a, sen(X;) =1]) = Pl0< X;, <a] = F(x) - ; 
= 2F(0)—1)5 = PIX| <2|Pisen(X,) = 1). 


Based on this lemma, the ranks of the |X;|s are independent of the signs of 
the X;s. Note that the ranks are a permutation of the integers 1,2,...,n. By 
the lemma, this independence is true for any permutation. In particular, suppose 
we use the permutation that orders the absolute values. For example, suppose the 
observations are —6.1, 4.3, 7.2,8.0,—2.1. Then the permutation 5,2,1,3,4 orders 
the absolute values; that is, the fifth observation is the smallest in absolute value, 
the second observation is the next smallest, etc. This permutation is called the 
anti-ranks, which we denote generally by by 71,72,...,in. Using the anti-ranks, 
we can write T' as 


T=) jegn(X;,); (10.3.5) 
j=l 


where, by Lemma 10.3.1, sgn(X;,) are iid with support {—1,1} and pmf (10.3.4). 
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Based on this observation, for s such that —oo < s < oo, the mef of T is 


Elexp{sT}] = FE |exp S © sj sen(X;,) 
j=l 

= |] Hlexp{sjsgn(X:,)}] 

j=l 
_ 2 8s ~ eS) 
_ I (5° - a" ) 

j=l 
7 rl e-“d + ef) (10.3.6) 
= =] . Ry 


Because the mgf does not depend on the underlying symmetric pdf f(x), the test 
statistic T is distribution free under Hp. Although the pmf of T cannot be obtained 
in closed form, this mgf can be used to generate the pmf for a specified n; see 
Exercise 10.3.1. 

Because the sgn(X;,)s are mutually independent with mean zero, it follows that 
Ey, [1] = 0. Further, because the variance of sgn(X;,) is 1, we have 


Var 1, (T a5 vient jsgn(Xi,) a n(n + 1)(2n + 1)/6. 
j=l 


We summarize these results in the following theorem: 


Theorem 10.3.1. Assume that Model (10.2.1) is true for the random sample 
X1,...,Xn. Assume also that the pdf f(x) is symmetric about 0. Then under 
Fo, 


T is distribution free with a symmetric pmf (10.3.7) 
Ex, (T] = 0 (10.3.8) 
1)(2 1 
Vina = eee) nA i!) (10.3.9) 
+ hus an asymptotically N(0,1) distribution.  (10.3.10) 
Vary, (T) 


Proof: The first part of (10.3.7) and the expressions (10.3.8) and (10.3.9) were 
derived above. The asymptotic distribution of T certainly is plausible and its proof 
can be found in more advanced books. To obtain the second part of (10.3.7), we 
need to show that the distribution of T is symmetric about 0. But by the mgf of 
Y, (10.3.6), we have 


Elexp{s(—T)} = Elexp{(—s)T}] = Elexp{sT}]. 


Hence JT and —T have the same distribution, so T’ is symmetrically distributed 
about 0. m 
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Note that the support of T’ is much denser than that of the sign test, so the 
normal approximation is good even for a sample size of 10. 

There is another formulation of T that is convenient. Let T+ denote the sum of 
the ranks of the positive X;s. Then, because the sum of all ranks is n(n + 1)/2, we 
have 


T = Ysen(x )R|X;| = ‘2 R|X;| — *. RX; | 


Xi >0 Xi <0 
1 
= 25° RX |- mnt cat 
Xi >0 
= opt Perth) (10.3.1) 


2 


Hence T'* is a linear function of T and thus is an equivalent formulation of the 
signed-rank test statistic T. For the record, we note the null mean and variance of 
Tt: 

Ex,(T+) = 22 and Varg,(T*) = 2eQent | (10.3.12) 


The intrinsic R function wilcox.test computes the signed-rank analysis, re- 
turning the test statistic T* and the p-value. If the sample is in the R vector x 
then the signed-rank test of the hypotheses (10.3.1) is computed by the R. com- 
mand wilcox.test(x,alt="greater"). The arguments for the other one-sided 
and the two-sided hypotheses are respectively alt="less" and alt="two.sided". 
To compute the signed-rank test of the hypotheses Hp : 6 = 09 versus H; : 0 # 0, 
use the command wilcox.test (x,alt="two.sided",mu=theta0). Also, the R call 
psignrank(y,n) computes the cdf of T* at y. 


Example 10.3.1 (Zea mays Data of Darwin). Reconsider the data set discussed 
in Example 4.5.1. Recall that W; is the difference in heights of the cross-fertilized 
plant minus the self-fertilized plant in pot 2, fori =1,...,15. Let @ be the location 
parameter and consider the one-sided hypotheses 


Ho: 0=0 versus H,: 0>0. (10.3.13) 


Table 10.3.1 displays the data and the signed ranks. 

Adding up the ranks of the positive items in column 5 of Table 10.3.1, we obtain 
Tt = 96. Using the exact distribution, the R command is 1-psignrank(95,15)), 
we obtain the p-value, p = Py,(T'* > 96) = 0.021. For comparison, the asymptotic 
p-value, using the continuity correction is 


Py (TP > 96) = Pa.(P* = 95.5) 


2 


el gs 95.5 — 60 
\/15-16- 31/24 
= P(Z> 2.016) = 0.022, 
which is quite close to the exact value of 0.021. 


Suppose the R vector ds contains the paired differences between cross and self- 
fertilized. Then the R command wilcox.test(ds,alt="greater") computes the 
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Table 10.3.1: Signed Ranks for Darwin Data, Example 10.3.1 


Cross- Self- Signed- 

Pot | Fertilized | Fertilized | Difference Rank 

1 23.500 17.375 6.125 11 

2 12.000 20.375 —8.375 —14 

3 21.000 20.000 1.000 2 

4 22.000 20.000 2.000 4 

5 19.125 18.375 0.750 1 

6 21.550 18.625 2.925 5 

7 22.125 18.625 3.500 7 

8 20.375 15.250 5.125 9 

9 18.250 16.500 1.750 3 

10 21.625 18.000 3.625 8 

11 23.250 16.250 7.000 12 

12 21.000 18.000 3.000 6 

13 22.125 12.750 9.375 15 

14 23.000 15.500 7.500 13 

15 12.000 18.000 —6.000 —10 
value of T+ along with the p-value. The computed values are the same as those 


computed above. 


There is another formulation of Jt which is useful for obtaining the properties 
of the Wilcoxon signed-rank test and confidence intervals for 6. Let X; > 0 and 
consider all X; such that —X; < X; < X;. Thus all the averages (X; + X;)/2, 
under these restrictions, are positive, including (X; + X;)/2. From the restriction, 
though, the number of these positive averages is simply the R|X;|. Doing this for 
all X; > 0, we obtain 


The pairwise averages (X; + X;)/2 are often called the Walsh averages. Hence the 
signed-rank Wilcoxon can be obtained by counting the number of positive Walsh 
averages. 

Based on the identity (10.3.14), we obtain the corresponding process. Let 


T+ (8) = #ics{[(X; — 8) + (Xi — 9)]/2 > 0} = Hees {(Xj + Xi)/2 > O}. (10.3.15) 


The process associated with T*(@) is much like the sign process, (10.2.9). Let 
W, < We <---> < Wynyiy/2 denote the n(n+1)/2 ordered Walsh averages. Then a 
graph of T*(0) would appear as in Figure 10.2.2, except the ordered Walsh averages 
would be on the horizontal axis and the largest value on the vertical would be 
n(n + 1)/2. Hence the function T*(6) is a decreasing step function of 6, which 
steps down one unit at each Walsh average. This observation greatly simplifies the 
discussion on the properties of the signed-rank Wilcoxon. 
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Let cq denote the critical value of a level a test of the hypotheses (10.3.1) based 
on the signed-rank test statistic Tt; ie., a = PH, (T* > ca). Let ysw(0) = 
Po(T* > cq), for 6 > 00, denote the power function of the test. The translation 
property, Lemma 10.2.1, holds for the signed-rank Wilcoxon. Hence, as in Theorem 
10.2.1, the power function is a nondecreasing function of @. In particular, the 
signed-rank Wilcoxon test is an unbiased test for the one-sided hypotheses (10.3.1). 


10.3.1 Asymptotic Relative Efficiency 


We investigate the efficiency of the signed-rank Wilcoxon by first determining its 
efficacy. Without loss of generality, we can assume that 0) = 0. Consider the same 
sequence of local alternatives discussed in the last section; i.e., 


Ho: 6=0 versus Hin: 0, (10.3.16) 


= 8 
Lo Jn? 
where 6 > 0. Contemplate the modified statistic, which is the average of TT (6), 


i, 2 
(0) = iaoy (10.3.17) 


Then, by (10.3.12), 


at n(n at . 
Ey(T (0) = qa = 5 and 02, (0) = Varo[T’ (0)] = Fett. (10.3.18) 


Let ap, = 2/n(n +1). Note that we can decompose T* (On) into two parts as 


T" (8n) = Gn9(On) + an > 1(X; + Xj > 29n) = anS(On) + @nT* (On), (10.3.19) 


i<j 
where S() is the sign process (10.2.9) and 


T* (On) = >. (Xi + Xj > 20,). (10.3.20) 


i<j 
To obtain the efficacy, we require the mean 


jig (On) = Eo, [T (0)] = Eo[T" (-6n)]. (10.3.21) 


But by (10.2.14), @nEo(S(—On)) = ann(2~! — F(—6,)) — 0. Hence we need only 
be concerned with the second term in (10.3.19). But note that the Walsh averages 
in T*(0) are identically distributed. Thus 


n 
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This latter probability can be expressed as follows: 


Py(X1+ Xo >—20n) = Ep[Po(X1 > —26,, — Xo|Xo)] = Holl — F(—20, — Xo)] 
= / [1 — F(-20, — 2)|f (2) de 


= ‘i ” F (26, + 2) f(x) de 


2 


i. [F (2) + 20, f(«)]f(«) da 


= 5 +26, | f(a) de, (10.3.23) 


where we have used the facts that X; and X2 are iid and symmetrically distributed 
about 0, and the mean value theorem. Hence 


1 co 
Hat (On) © An (5) (5 + 20, | J?(e) dr) : (10.3.24) 
Putting (10.3.18) and (10.3.24) together, we have the efficacy 


— Hine (0) _ ae 


In a more advanced text, this development can be made into a rigorous argument 
for the following asymptotic power lemma. 


Theorem 10.3.2 (Asymptotic Power Lemma). Consider the sequence of hypotheses 
(10.3.16). The limit of the power function of the large sample, size a, signed-rank 
Wilcoxon test is given by 

lim ysr(On) = 1— (za — dT"), (10:3.26) 
where Tw = 1/[V12 f°. f?(x) da] is the reciprocal of the efficacy cr+ and ®(z) is 
the cdf of a standard normal random variable. 


As shown in Exercise 10.3.10, the parameter ty is a scale functional. 

The arguments used in the determination of the sample size in Section 10.2 for 
the sign test were based on the asymptotic power lemma; hence, these arguments 
follow almost verbatim for the signed-rank Wilcoxon. In particular, the sample size 
needed so that a level a signed-rank Wilcoxon test of the hypotheses (10.3.1) can 
detect the alternative 6 = 4) + 6* with approximate probability * is 


ne (Geer) (10.3.27) 


Using (10.2.26), the ARE between the signed-rank Wilcoxon test and the t-test 
based on the sample mean is 


2 
ARE(T, t) = — = =~. (10.3.28) 
T Ww 
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We now derive some AREs between the Wilcoxon and the t-test. As noted 
above, the parameter Ty is a scale functional and, hence, varies directly with scale 
transformations of the form aX, for a > 0. Likewise, the standard deviation a is 
also a scale functional. Therefore, because the AREs are ratios of scale functionals, 
they are scale invariant. Hence, for derivations of AREs, we can select a pdf with a 
convenient choice of scale. For example, if we are considering an ARE at the normal 
distribution, we can work with the N(0,1) pdf. 


Example 10.3.2 (ARE(W,t) at the normal distribution). If f(x) is a N(0,1) pdf, 
then 


nt = vf (Keo) da 


we [- Ty OP e/V!} dx = i 


Hence 77, = 1/3. Since o = 1, we have 


o2 


3 

ARE(W,t) = =- = — = 0.955. (10.3.29) 
Tw OT 

As discussed above, this ARE holds for all normal distributions. Hence, at the 

normal distribution, the Wilcoxon signed-rank test is 95.5% efficient as the t-test. 


The Wilcoxon is called a highly efficient procedure. m 


Example 10.3.3 (ARE(W,t) at a Family of Contaminated Normals). For this 
example, suppose that f(a) is the pdf of a contaminated normal distribution. For 
convenience, we use the standardized pdf given in expression (10.2.30) with b = 1. 
Recall that for this distribution, (1 — €) proportion of the time the sample is drawn 
from a N(0,1) distribution, while € proportion of the time the sample is drawn from 
a N(0,02) distribution. Recall that the variance is 0? = 1+ €(0? — 1). Note that 
the formula for the pdf f(a) is given in expression (3.4.17). In Exercise 10.3.5 it is 
shown that - ae . ae) 
2 —€ € e(l—e 
| Pe@a=- WE ead an (10.3.30) 
Based on this, an expression for the ARE can be obtained; see Exercise 10.3.5. We 
used this expression to determine the AREs between the Wilcoxon and the t-tests 
for the situations with o, = 3 and € varying from 0.00—0.25, displaying them in 
Table 10.3.2. For convenience, we have also displayed the AREs between the sign 
test and these two tests. 
Note that the signed-rank Wilcoxon is more efficient than the t-test even at 1% 
contamination and increases to 150% efficiency for 15% contamination. m 


10.3.2 Estimating Equations Based on Signed-Rank Wilcoxon 


For the sign procedure, the estimation of 9 was based on minimizing the ZL; norm. 
The estimator associated with the signed-rank test minimizes another norm, which 
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Table 10.3.2: AREs among the sign, the Signed-Rank Wilcoxon, and the t-Tests 
for Contaminated Normals with o, = 3 and Proportion of Contamination € 


a eo 


is discussed in Exercises 10.3.7 and 10.3.8. Recall that we also show that the location 
estimator based on the sign test could be obtained by inverting the test. Considering 
this for the Wilcoxon, the estimator Ow solves 


T+ (Ow) = — 


(10.3.31) 

Using the description of the function T*(@) after its definition, (10.3.15), it 
is easily seen that Ow = median{(X; + X;)/2}; ie., the median of the Walsh 
averages. This is often called the Hodges-Lehmann estimator because of several 
seminal articles by Hodges and Lehmann on the properties of this estimator; see 
Hodges and Lehmann (1963). 

The R function wilcox.test computes the Hodges-Lehmann estimate. To il- 
lustrate its computation, consider the Darwin data in Example 10.3.1. Let the R 
vector ds contain the paired differences, Cross — Self. The R code segment given 
by wilcox.test(ds,conf.int=T) then computes the Hodges—Lehmann estimate 
to be 3.1375. So we estimate the difference in heights to be 3.1375 inches. 

Once again, we can use practically the same argument that we used for the sign 
process to obtain the asymptotic distribution of the Hodges-Lehmann estimator. 
We summarize the result in the next theorem. 


Theorem 10.3.3. Consider a random sample X1,X2,X3,...,Xn which follows 
Model (10.2.1). Suppose that f(x) is symmetric about 0. Then 


Vn(6w — 0) > N(0,72,), (10.3.32) 


where Tw = (vB. F7(@) die) 


Using this theorem, the AREs based on asymptotic variances for the signed-rank 
Wilcoxon are the same as those defined above. 


10.3.3. Confidence Interval for the Median 


Because of the similarity between the processes $(@) and T'* (6), confidence intervals 
for @ based on the signed-rank Wilcoxon follow the same way as do those based on 
S(@). For a given level a, let cy, an integer, denote the critical point of the signed- 
rank Wilcoxon distribution such that P9[T* (0) < cwi] = a/2. As in Section 10.2.3, 
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we then have that 


l—-a = Polewr <TT(0) <n-cwi] 
Po[Wewitt <9 < Wm—ew1); (10.3.33) 


where m = n(n+1)/2 denotes the number of Walsh averages. Therefore, the interval 
[Wewiti, Wm—ew,) is a (1 — a)100% confidence interval for 0. 

We can use the asymptotic null distribution of T+, (10.3.10), to obtain the 
following approximation to cw . As shown in Exercise 10.3.6, 


_ n(n +1) n(n+1)(Q2n+1) 1 
cw eS  agjf (10.3.34) 


where ®(—z,/2) = a/2. In practice, we use the closest integer to cy. 

In R, this confidence interval is computed by the R function wilcox.test. For 
instance, for the Darwin data let the R vector ds contain the paired differences, 
Cross — Self. Then the call wilcox.test(ds,conf.int=T,conf.level=.95) com- 
putes a 95% confidence interval for the median of the differences. Its computation 
results in the interval (0.5000, 5.2125). Hence, with confidence 95%, we estimate 
that cross-fertilized zea mays are between 0.5 to 5.2 inches taller than self-fertilized 
ones. 


10.3.4 Monte Carlo Investigation 


The AREs derived in this chapter are asymptotic. In this section, we describe 
Monte Carlo techniques which investigate the relative efficency between estimators 
for finite sample sizes. Comparisons are performed over families of distributions 
and a selection of sample sizes. Each combination of distribution and sample size 
is referred to as a situation. We also select a simulation size n,, which is usually 
quite large. We next describe a typical simulation to investigate the relative efficency 
between two estimators. 
For notation, let X1,..., X, be arandom sample that follows the location model, 
(10.2.1), ie 
Xj =0+6, — Laue sng TY (10.3.35) 


where e;’s are iid with pdf f(x) and f(x) is symmetric about 0. For our discussion, 
consider the case of two location estimators of 0, which we denote by 0; and A>. 
Since these are location estimators, we further assume without loss of generality 
that the true 0 = 0. 

Let n denote the sample size and let f(a) denote the pdf for a given situation. 
Then ns independent random samples of size n are generated from f (x). For the 
ith sample, denote the estimates by 01; and Oa, i=1,...,n,. For the estimator 0;, 
consider the mean square error over the simulations given by 


1< 
MSE; =—5°6?,, j=1,2. (10.3.36) 
Ns 4 
i=l 
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As sketched in Exercise 10.3.2, under the assumptions of symmetry and location 
estimators, MSE; is a consistent estimator of the variance of 6; for a sample of size 
n. Hence, the estimate of the relative efficiency (RE,,) between the estimators 0; 
and 05 at sample size n is the ratio 


RE, (01, 2) = = (10.3.37) 

To illustrate this discussion, consider a study comparing the Hodges-Lehmann 
and sample mean estimators over the family of contaminated normal distributions 
with rate of contamination € and the standard deviation ratio o-, where we are using 
the notation of Example 10.3.3. The R function rcn.R is used to generate samples 
from a contaminated normal. The following R function aresimcn.R computes the 
simulation and returns the estimate of RE,: 

aresimcn <- function(n,nsims,eps,vc){ 
chl <- c(Q); cxbar <- c() 
for(i in 1:nsims){ 
x <- ren(n,eps,vc) 
chl <- c(chl,wilcox.test(x,conf.int=T)$est) 
cxbar <- c(cxbar,t.test(x,conf.int=T)$est) 
} 
aresimcn <- mses(cxbar,0)/mses(chl1,0) 
return (aresimcn) } 
The function mses.R computes the MSEs, (10.3.36). All three functions are at the 
site listed in the Preface. 

For a specific situation set n = 30 with samples generated from the contami- 
nated normal distribution with rate of contamination ¢ = 0.25 and the standard 
deviation ratio 0. = 3. From Table 10.3.2, the asymptotic ARE is 1.616. Our run 
of the function aresimcn.R using 10,000 simulations at these settings produced the 
estimate 1.561 for the relative efficiency at sample size n = 30. This is close to the 
asymptotic value. The actual call was aresimcn(30,10000, .25,3). We also ran 
the situation with « = 0.20 and o, = 25. In this case, the estimated RE for samples 
of size n = 30 was 40.934; i.e., we estimate that the Hodges-Lehmann estimator is 
41% more efficient that the sample mean at this contaminated normal distribution 
for a sample size of 30. 


EXERCISES 


10.3.1. (a) For n = 3, expand the mgf (10.3.6) to show that the distribution of 
the signed-rank Wilcoxon is given by 


(b) Obtain the distribution of the signed-rank Wilcoxon for n = 4. 
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10.3.2. Consider the location Model (10.3.35). Assume that the pdf of the random 
errors, f(x), is symmetric about 0. Let @ be a location estimator of 0. Assume that 
E(6") exists. 


(a) Show that @ is an unbiased estimator of 6. 


Hint: Assume without loss of generality that 0 = 0; start with E(0) = 


n~ 


E[0(X1,..., Xn)]; and use the fact that X; is symmetrically distributed about 
0 


(b) As in Section 10.3.4, suppose we generate n, independent samples of size n 
from the pdf f(x) which is symmetric about 0. For the ith sample, let 6; be 
the estimate of 6. Show that nz! >>;'s, 0? — V(@), in probability. 


10.3.3. Modify the code of the R function aresimcn.R so it samples from the 
N(0,1) distribution. Estimate the RE between the Hodges-Lehmann estimator 
and X for the sample sizes n = 15, 25,50 and 100. Use 10,000 simulations for each 
sample size. Compare your results to the asymptotic ARE which is 0.955. 


10.3.4. Consider the self rival data presented in Exercise 4.6.5. Recall that it is 
a paired design consisting of the pairs (Self;, Rival;), for i = 1,...,20, where Self; 
and Rival; are the running times for circling the bases for the respective treat- 
ments of Self motivation and Rival motivation. The data can be found in the file 
selfrival.rda. Let X; = Self; — Rival; denote the paired differences and model 
these in the location model as X; = 0+ e;. Consider the hypotheses Hp : 6 = 0 
versus 1, : 040. 


(a) Obtain the signed-rank test statistic and p-value for these hypotheses. State 
the conclusion (in terms of the data) using the level 0.05. 


(b) Obtain the ¢ test statistic and p-value and conclude using the level 0.05. 


(c) To see the effect that an outlier has on these two analyses, change the 20th 
rival time from 17.88 to 178.8. Comment on how the analyses changed due to 
the outlier. 


(d) Obtain 95% confidence intervals for 6 for both analyses for the original data 
and the changed data. Comment on how the confidence intervals changed due 
to the outlier. 


10.3.5. Assume that f(a) has the contaminated normal pdf given in expression 
(3.4.17). Derive expression (10.3.30) and use it to obtain ARE(W,t) for this pdf. 


10.3.6. Use the asymptotic null distribution of TT, (10.3.10), to obtain the ap- 
proximation (10.3.34) to cw1. 


10.3.7. For a vector v € R”, define the function 
IIvl] = 55 R(vil)Iel. (10.3.38) 
i=1 


Show that this function is a norm on R”; that is, it satisfies the properties 
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1. ||v|| > 0 and ||v|| = 0 if and only if v = 0. 
2. |lav|| = |al||v||, for all a such that —co <a<o. 
3. jjut+v]| < |lul] + |v], for all u,v € R”. 


For the triangle inequality, use the anti-rank version, that is, 
n 
IIvll = So dle, - (10.3.39) 
j=l 


Then use the following fact: If we have two sets of n numbers, for example, 
{t1, to,...,tn} and {s1,82,...,5,}, then the largest sum of pairwise products, one 
from each set, is given by a ti,Sk,;, Where {i;} and {kj} are the anti-ranks for 
the t; and s;, respectively, i.e., ti, < ti, <-++ < ti, and sp, < Sp. <-++ < Sk,- 


10.3.8. Consider the norm given in Exercise 10.3.7. For a location model, define 
the estimate of to be m 
6 = Argming||X; — @]]. (10.3.40) 


Show that @ is the Hodges-Lehmann estimate, i.e., satisfies (10.4.27). 
Hint: Use the anti-rank version (10.3.39) of the norm when differentiating with 
respect to 6. 


10.3.9. Prove that a pdf (or pmf) f(#) is symmetric about 0 if and only if its mgf 
is symmetric about 0, provided the mef exists. 


10.3.10. In Exercise 10.1.4, we defined the term scale functional. Show that the 
parameter Ty, (10.3.26), is a scale functional. 


10.4 Mann—Whitney—Wilcoxon Procedure 


Suppose X 1, X2,...,Xn, is arandom sample from a distribution with a continuous 
cdf F(a) and pdf f(a) and Yi, Yo,...,¥n,. is a random sample from a distribution 
with a continuous cdf G(a) and pdf g(x). For this situation there is a natural null 
hypothesis given by Hp: F(a) = G(«) for all a; i.e., the samples are from the same 
distribution. What about alternative hypotheses besides the general alternative 
not Ho? An interesting alternative is that X is stochastically larger than Y, 
which is defined by G(x) > F(x), for all x, with strict inequality for at least one x. 
This alternative hypothesis is discussed in the exercises. 

For the most part in this section, however, we consider the location model. In 
this case, G(x) = F(a — A) for some value of A. Hence the null hypothesis becomes 
Hy: A =O. The parameter A is often called the shift between the distributions 
and the distribution of Y is the same as the distribution of X + A; that is, 


P(Y <y)=P(X+A<y)=F(y—A). (10.4.1) 


If A > 0, then Y is stochastically larger than X; see Exercise 10.4.8. 
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In the shift case, the parameter A is independent of what location functional 
is used. To see this, suppose we select an arbitrary location functional for X, say, 
T(Fx). Then we can write X; as 


X,; =T(Fx) +61, (10.4.2) 
where €1,...,€n, are iid with T(F.) = 0. By (10.4.1) it follows that 
Y¥; =T(Fx)+A+e;, j=1,2,...,n2. (10.4.3) 


Hence T(Fy) = T(Fx)+ A. Therefore, A = T(Fy) — T(Fx) for any location 
functional; i.e., A is the same no matter what functional is chosen to model location. 

Assume then that the shift model, (10.4.1), holds for the two samples. Alterna- 
tives of interest are the usual one- and two-sided alternatives. For convenience we 
pick on the one-sided hypotheses given by 


Ho: A=O versus H,;: A>O. (10.4.4) 


The exercises consider the other hypotheses. Under Ho, the distributions of X 
and Y are the same, and we can combine the samples to have one large sample of 
n =n 1+ ng observations. Suppose we rank the combined samples from 1 to n and 
consider the statistic 


W=)>) RO), (10.4.5) 


where R(Y;) denotes the rank of Y; in the combined sample of n items. This 
statistic is often called the Mann—Whitney—Wilcoxon (MWW) statistic. Under 
Ho the ranks are uniformly distributed between the X;s and the Y;s; however, under 
H,: A> 0, the Y;s should get most of the large ranks. Hence an intuitive rejection 
rule is given by 

Reject Ho in favor of Hy if W >c. (10.4.6) 


We now discuss the null distribution of W, which enables us to select c for 
the decision rule based on a specified level a. Under Ho, the ranks of the Y;s are 
equilikely to be any subset of size n2 from a set of n elements. Recall that there are 


(7) such subsets; therefore, if {71,...,7n,} is a subset of size ng from {1,...,n}, 
then <4 
n 
P[|R(Y1) =11,.--,R(Yn,) = Tne] = (") ; (10.4.7) 


This implies that the statistic W is distribution free under Hp. Although the null 
distribution of W cannot be obtained in closed form, there are recursive algorithms 
which obtain this distribution; see Chapter 2 of the text by Hettmansperger and 
McKean (2011). In the same way, the distribution of a single rank R(Y;) is uniformly 
distributed on the integers {1,...,n}, under Ho. Hence we immediately have 


Bng(W) = > Bne(RO) = di = SEY = BED 


j=l j=l i=1 
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The variance is displayed below (10.4.10) and a derivation of a more general case is 
given in Section 10.5. It also can be shown that W is asymptotically normal. We 
summarize these items in the theorem below. 


Theorem 10.4.1. Suppose X1, X2,...,Xn, 1s a random sample from a distribution 
with a continuous cdf F(a) and Yi, Y2,..-,Yn. 1s a random sample from a distribu- 
tion with a continuous cdf G(x). Suppose Hy: F(x) = G(a), for all x. If Ho is 
true, then 


W is distribution free with a symmetric pmf (10.4.8) 
1 
Ex,(W] = mine) (10.4.9) 
1 
Varn, (W) = mn) (10.4.10) 
W—ne(n+1)/2 


has an asymptotically N(0,1) distribution. (10.4.11) 
Vary, (W) 


The only item of the theorem not discussed above is the symmetry of the null 
distribution, which we show later. First, consider this example: 


Example 10.4.1 (Water Wheel Data Set). In an experiment discussed in Abebe 
et al. (2001), mice were placed in a wheel that is partially submerged in water. If 
they keep the wheel moving, they avoid the water. The response is the number of 
wheel revolutions per minute. Group 1 is a placebo group, while Group 2 consists 
of mice that are under the influence of a drug. The data are 


[GroupIX [23 03 52 31 11 09 20 07 14 03 


/Group2Y [08 28 40 24 12 00 62 15 288 07 


The data are in the file waterwheel .rda. Comparison boxplots of the data (asked 
for in Exercise 10.4.9) show that the two data sets are similar except for the large 
outlier in the treatment group. A two-sided hypothesis seems appropriate in this 
case. Notice that a few of the data points in the data set have the same value 
(are tied). This happens in real data sets. We follow the usual practice and use 
the average of the ranks involved to break ties. For example, the observations 
tQ = 19 = 0.3 are tied and the ranks involved for the combined data are 2 and 
3. Hence we use 2.5 for the ranks of each of these observations. Continuing in 
this way, the Wilcoxon test statistic is w = i R(y;) = 116.50. The null mean 
and variance of W are 105 and 175, respectively. The asymptotic test statistic 
is z = (116.5 — 105)//175 = 0.869 with p-value 2*(1-pnorm(0.869)) = 0.3848. 
Hence Hp would not be rejected. The test confirms the comparison boxplots of the 
data. The t-test based on the difference in means is discussed in Exercise 10.4.9. In 
Example 10.4.2, we discuss the R computation. m 


We next want to derive some properties of the test statistic and then use these 
properties to discuss point estimation and confidence intervals for A. As in the 
last section, another way of writing W proves helpful in these regards. Without 
loss of generality, assume that the Yjs are in order. Recall that the distributions 
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of X; and Y; are continuous; hence, we treat the observations as distinct. Thus 
R(Y;) = #:A{Xi < Yj} + #it{ Vi < Yj}. This leads to 


W=> RW) = DAI <Y}+ DAK <4} 
j=1 j=l j=l 
= #1){%) > X}+ ela (10.4.12) 


Let U = #i,;{Y; > Xi}; then we have W =U + no(n2+1)/2. Hence an equivalent 
test for the hypotheses (10.4.4) is to reject Ho if U > cg. It follows immediately 
from Theorem 10.4.1 that, under Ho, U is distribution free with mean nj n2/2 
and variance (10.4.10) and that it has an asymptotic normal distribution. The 
symmetry of the null distribution of either U or W can now be easily obtained. 
Under Ho, both X; and Y; have the same distribution, so the distributions of U 
and U’ = #i,;{X; > Y;} must be the same. Furthermore, U + U’ = ning. This 
leads to 


PH (u- a — u) 


I 
= 
oN 
= 
3 
bo 
| 
- 
| 
I 


II 
> 
-—— 
_ 
| 
II 
| 
eae 


I 
= 
a 
g 
= 
S 
| 
| 
n"“~” 


which yields the desired symmetry result in Theorem 10.4.1. 


Example 10.4.2 (Water Wheel, Continued). For the R commands to compute the 
Wilcoxon analysis, suppose y and x contain the respective samples on Y and X. 
The R call wilcox.test(y,x) computes the Wilcoxon test. The form used is the 
statistic U = #:,;{Y) > X:}. For the data in Example 10.4.1, let the R vectors 
grpi and grp2 contain the samples for group 1 and group 2, respectively. Then the 
call and the results are: 
wilcox.test(grp2,grp1); W = 61.5, p-value = 0.4053 

Note that R uses the label W for U. As a check, 61.5+ 10(11)/2 = 116.5 = > R(y;), 
which agrees with the computation in Example 10.4.1. The R p-value is exact in 
the case that there are no ties and if n; < 50,7 = 1,2. Otherwise it is based on the 
asymptotic distribution. Notice that the asymptotic p-value differs a little from its 
R computed value. The R function pwilcox(u,ni1,n2) computes the exact cdf of 
U.@ 


Note that if G(~) = F(a — A), then Y; — A has the same distribution as X;. So 
the process of interest here is 


Hence U(A) is counting the number of differences Y; — X; that exceed A. Let 
D, < Dz < +++ < Dyin, denote the ninz ordered differences of Y; — X;. Then 
the graph of U(A) is the same as that in Figure 10.2.2, except the Djs are on the 
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horizontal axis and the n on the vertical axis is replaced by ni; that is, U(A) isa 
decreasing step function of A that steps down one unit at each difference D;, with 
the maximum value of n,n9. 

We can then proceed as in the last two sections to obtain properties of inference 
based on the Wilcoxon. Let the integer cq denote the critical value of a level a 
test of the hypotheses (10.2.2) based on the statistic U; ie, a = Py,(U > ca). 
Let yo (A) = Pa(U > cq), for A > 0, denote the power function of the test. 
The translation property, Lemma 10.2.1, holds for the process U(A). Hence, as in 
Theorem 10.2.1, the power function is a nondecreasing function of A. In particular, 
the Wilcoxon test is an unbiased test for the one-sided hypotheses (10.4.4). 


10.4.1 Asymptotic Relative Efficiency 


The asymptotic relative efficiency (ARE) of the Wilcoxon follows along similar lines 
as for the sign test statistic in Section 10.2.1. Here, consider the sequence of local 
alternatives given by 


Ho: A=O0 versus Hin: An = an (10.4.14) 


where 6 > 0. We also assume that 


a _, 1, = — 2, where \y +A. = 1. (10.4.15) 


n 


This assumption implies that ny/ng — A1/Az; i.e, the sample sizes maintain the 
same ratio asymptotically. 
To determine the efficacy of the MWW, consider the average 


U(A) = U(A). (10.4.16) 
nynNg 
It follows immediately that 
w(0) = Eo(U(0)) = 5 and oF(0) = EE. (10.4.17) 
Because the pairs (X;, Y;) are iid we have 
U(An) = Ea,,(U(0)) = Eo(U(—An)) = Po(¥Y — X > —A,). (10.4.18) 


The independence of X and Y and the fact f°. F(x) f(x) da = 1/2 gives 


P(Y-X>-An) = Se are 
FO Ay)) 


+A, /. f? (x) da, (10.4.19) 
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where we have applied the mean value theorem to obtain the last line. Putting 
together (10.4.17) and (10.4.19), we have the efficacy 


ey = lim eo Joa [ Pe dv. (10.4.20) 


n—0o Jno 0) ~ 
This derivation can be made rigorous, leading to the following theorem: 


Theorem 10.4.2 (Asymptotic Power Lemma). Consider the sequence of hypotheses 
(10.4.14). The limit of the power function of the size a Mann—Whitney- Wilcoxon 
test is given by 


lim yy(A,) =1-@ (26 = Vxr2574") (10.4.21) 


where Tw = 1/V/12 ae f?(x) dx is the reciprocal of the efficacy cy and ®(z) is the 
cdf of a standard normal random variable. 


As in the last two sections, we can use this theorem to establish a relative mea- 
sure of efficiency by considering sample size determination. Consider the hypotheses 
(10.4.4). Suppose we want to determine the sample size n = n1 + ng for a level a 
MWW test to detect the alternative A* with approximate power 7*. By Theorem 
10.4.2, we have the equation 


Y= (Vn d*/Vn) = 1 — ®(Za — Vda nA* Ty"). (10.4.22) 


This leads to the equation 


Zaye = 2q — VM AWTy » (10.4.23) 

where ®(z,+) = 1—*. Solving for n, we obtain 
(& = =cyew (10.4.24) 

0 A*./Maa 


To use this in applications, the sample size proportions Ay = n1/n and Ap = n2/n 
must be given. As Exercise 10.4.1 points out, the most powerful two-sample designs 
have sample size proportions of 1/2, i.e., equal sample sizes. 

To use this to obtain the asymptotic relative efficiency between the MWW and 
the two-sample pooled t-test, Exercise 10.4.2 shows that the sample size needed for 
the two-sample t-tests to attain approximate power * to detect A* is given by 


(Za — ane 
|. 10.4.25 


where o is the variance of e;. Hence, as in the last section, the asymptotic relative 
efficiency between the Wilcoxon test (MWW) and the ¢-test is the ratio of the 
sample sizes (10.4.24) and (10.4.25), which is 


oc? 


za: 


ARE(MWW, LS) = = 
Ww 


(10.4.26) 
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Note that this is the same ARE as derived in the last section between the signed- 
rank Wilcoxon and the t-test. If f(x) is a normal pdf, then the MWW has efficiency 
95.5% relative to the pooled t-test. Thus the MWW tests lose little efficiency at 
the normal. On the other hand, it is much more efficient than the pooled t-test at 
the family of contaminated normals (with e > 0), as in Example 10.3.3. 


10.4.2 Estimating Equations Based on the Mann—Whitney— 
Wilcoxon 


As with the signed-rank Wilcoxon procedure in the last section, we invert the test 
statistic to obtain an estimate of A. As discussed in the next section, this esti- 
mate can be defined in terms of minimizing a norm. The estimator Ow solves the 
estimating equations 

nine 


U(A) = En,(U) = ——. (10.4.27) 


Recalling the description of the process U(A) described above, it is clear that the 
Hodges—Lehmann estimator is given by 


Au = med,,;{Y; — Xj}. (10.4.28) 


The asymptotic distribution of the estimate follows in the same way as in the last 
section based on the process U(A) and the asymptotic power lemma, Theorem 
10.4.2. We avoid sketching the proof and simply state the result as a theorem: 


Theorem 10.4.3. Assume that the random variables X,,X2,...,Xn, are tid with 
pdf f(a) and that the random variables Y1,Y2,...,Yn, are tid with pdf f(a — A). 
Then 


Av has an approximate N (4,73, ( = +)) distribution, (10.4.29) 


mi | ne 
where Tw = (V12 f°. f?(x) dz)". 
As Exercise 10.4.6 shows, provided the Var(e;) = 0? < oo, the LS estimate 
Y — X of A has the following approximate distribution: 


Y — X has an approximate N (4, a ( + )) distribution. (10.4.30) 


i 
ny ne 


Note that the ratio of the asymptotic variances of Ay is given by the ratio (10.4.26). 
Hence the ARE of the tests agrees with the ARE of the corresponding estimates. 


10.4.3 Confidence Interval for the Shift Parameter A 


The confidence interval for A corresponding to the MWW estimate is derived the 
same way as the Hodges-Lehmann estimate in the last section. For a given level 
a, let the integer c denote the critical point of the MWW distribution such that 
Pa[U(A) < cl = a/2. As in Section 10.2.3, we then have 


l-a = Palc< U(A) < nin2-<¢ 
= PDS Dice, (10.4.31) 
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where D, < Dg < +--+ < Dn,n y denote the order differences Y; — X;. Therefore, 
the interval [De+41, Dnyns—c) is a (1 — a)100% confidence interval for A. Using the 
null asymptotic distribution of the MWW test statistic U, we have the following 
approximation for c: 


(10.4.32) 


where ®(—z,/2) = a/2; see Exercise 10.4.7. In practice, we use the closest integer 
toc. 


Example 10.4.3 (Example 10.4.1, Continued). Returning to Example 10.4.1, the 
computation in R (groups are in the vectors grp1 and grp2) yields: 
wilcox.test(grp2,grp1,conf.int=T) 
95 percent confidence interval: -0.8000273 2.8999445 
sample estimate: 0.5000127 
Hence, the Hodges—Lehmann estimate of the shift in locations is 0.50 and the con- 
fidence interval for the shift is (—0.800, 2.890). Hence, in agreement with the test 
statistic, the confidence interval covers the null hypothesis of A = 0. mg 


10.4.4 Monte Carlo Investigation of Power 


In Section 10.3.4, we discussed a Monte Carlo investigation of the finite sample size 
relative efficiency between two location estimators. In this section, we consider finite 
sample studies of the power of two tests. As in Section 10.3.4, a Monte Carlo study 
comparing the power of two tests would be over specified families of distributions 
and sample sizes, each combination of which is a situation of the study. For our 
brief presentation, we consider one such situation. 

The model is the two-sample location model described by (10.4.2)—(10.4.3) where 
A is the shift in location between the models. We consider the two-sided hypotheses 


Hy: A=0 versus H,: AF0. (10.4.33) 


Our study compares the power of the MWW and two-sample t-test, as defined in 
Example 8.3.1, for these hypotheses. For our specific situation we consider equal 
sample sizes ny = ng = 30 and the contaminated normal distribution with con- 
tamination rate « = 0.20 and standard deviation ratio a. = 10. As the level of 
significance, we select a = 0.05. Notice that for a given data set, a level a test 
rejects Ho if its p-value is less than or equal to a. 

We chose 10,000 simulations. The gist of the algorithm is straightforward. For 
each simulation, generate the independent samples; compute each test statistic; and 
record whether or not each test rejected. For each test, its empirical power is its 
number of rejections divided by the number of simulations. The following R func- 
tion wil2powsim.R incorporates this algorithm. The first line of code contains the 
settings that were used. 

n1=30 ;n2=30 ;nsims=10000; eps=. 20; vc=10;Delta=seq(-3,3,1) #Settings 
wil2powsim <- function(n1,n2,nsims,eps,vc,Delta=0,alpha=.05){ 
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indwil <-0; indt <- 0 
for(i in 1:nsims){ 
x <- ren(ni,eps,vc); y <- ren(n2,eps,vc) + Delta 
if (wilcox.test(y,x)$p.value <= alpha){indwil <- indwil + 1} 
if(t.test(y,x,var.equal=T)$p.value <= alpha){indt <- indt + 1} 
} 
powwil <- sum(indwil)/nsims; powt <- sum(indt)/nsims 
return (c(powwil,powt) )} 
Notice that power is computed at the sequence of alternatives A = —3,—2,...,3. 
For our run, here are the empirical powers: 


a eS ee ee 


a at 


Clearly for this situation the MWW test is much more powerful than the t-test. It 
is not surprising since the contaminated normal distribution has heavy tails and the 
t-test is impaired by the high percentage of outliers. Further, this agrees with the 
ARE between the MWW and t-tests for contaminated normal distributions. The 
empirical powers for A = 0 are the empirical levels that are close to the nominal 
a = 0.05. For both tests, the powers increase as A moves in either direction from 
0, as they should. 


EXERCISES 


10.4.1. By considering the asymptotic power lemma, Theorem 10.4.2, show that 
the equal sample size situation n1 = ng is the most powerful design among designs 
with n; + ng =n, n fixed, when level and alternatives are also fixed. 

Hint: Show that this problem is equivalent to maximizing the function 


ni(n — 1) 


g(m1) = 2 , 


n 
and then obtain the result. 


10.4.2. Consider the asymptotic version of the t-test for the hypotheses (10.4.4) 
which is discussed in Example 4.6.2. 


(a) Using the setup of Theorem 10.4.2, derive the corresponding asymptotic power 
lemma for this test. 


(b) Use your result in part (a) to obtain expression (10.4.25). 


10.4.3. In the power study presented in Section 10.4.4, the empirical powers at 
A = 0 are empirical levels. Find 95% confidence intervals for the true levels based 
on the empirical levels. Do they contain the nominal level a = 0.05? 


10.4.4. In the power study of Section 10.4.4, determine (by simulation) the neces- 
sary common sample size so that the Wilcoxon MWW test has approximately 80% 
power to detect A = 1. 


10.5. *General Rank Scores 607 


10.4.5. For the power study of Section 10.4.4, modify the R function wil2powsim.R 
to obtain the empirical powers for the N (0,1) distribution. 


10.4.6. Use the Central Limit Theorem to show that expression (10.4.30) is true. 


10.4.7. For the cutoff index c of the confidence interval (10.4.31) for A, derive the 
approximation given in expression (10.4.32). 


10.4.8. Let X be a continuous random variable with cdf F(x). Suppose Y = X+A, 
where A > 0. Show that Y is stochastically larger than X. 


10.4.9. Consider the data given in Example 10.4.1. 
(a) Obtain comparison boxplots of the data. 


(b) Show that the difference in sample means is 3.11, which is much larger than 
the MWW estimate of shift. What accounts for this discrepancy? 


(c) Show that the 95% confidence interval for A using t is given by (—2.7, 8.92). 
Why is this interval so much larger than the corresponding MWW interval? 


(d) Show that the value of the t-test statistic, discussed in Example 4.6.2, for 
this data set is 1.12 with p-value 0.28. Although, as with the MWW results, 
this p-value would be considered insignificant, it seems lower than warranted 
(consider, for example, the comparison boxplots of part (a)]. Why? 
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Suppose we are interested in estimating the center of a symmetric distribution 
using an estimator that corresponds to a distribution-free procedure. By the last 
two sections our choice is either the sign test or the signed-rank Wilcoxon test. If 
the sample is drawn from a normal distribution, then of the two we would choose 
the signed-rank Wilcoxon because it is much more efficient than the sign test at 
the normal distribution. But the Wilcoxon is not fully efficient. This raises the 
question: Is there is a distribution-free procedure that is fully efficient at the normal 
distribution, i.e., has efficiency of 100% relative to the t-test at the normal? More 
generally, suppose we specify a distribution. Is there a distribution-free procedure 
that has 100% efficiency relative to the mle at that distribution? In general, the 
answer to both of these questions is yes. In this section, we explore these questions 
for the two-sample location problem since this problem generalizes immediately to 
the regression problem of Section 10.7. A similar theory can be developed for the 
one-sample problem; see Chapter 1 of Hettmansperger and McKean (2011). 

As in the last section, let X1, X2,..., Xn, be a random sample from the contin- 
uous distribution with cdf and pdf F(a) and f(a), respectively. Let Y;, Yo,...,Yn, 
be arandom sample from the continuous distribution with cdf and pdf, respectively, 
F(a — A) and f(a — A), where A is the shift in location. Let n = ny + nz denote 
the combined sample sizes. Consider the hypotheses 


Ho: A=0 versus H,: A> 0. (10.5.1) 
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We first define a general class of rank scores. Let y(u) be a nondecreasing 
function defined on the interval (0,1), such that fo y?(u) du < oo. We call y(u) 
a score function. Without loss of generality, we standardize this function so that 
fis y(u) du = 0 and ie y?(u) du = 1; see Exercise 10.5.1. Next, define the scores 
a(t) = yli/(n+ 1], for 2 = 1,...,n. Then ag(1) < ag(2) < +--+ < ay(n). As- 
sume that >", a(i) = 0, (this essentially follows from [ y(w) du = 0, see Exercise 
10.5.12). Consider the test statistic 


Wo = > a(R(Yj)), (10.5.2) 


where R(Y;) denotes the rank of Y; in the combined sample of n observations. Since 
the scores are nondecreasing, a natural rejection rule is given by 


Reject Ho in favor of Hy if Wy > c. (10.5.3) 


Note that if we use the linear score function y(u) = V12(u — (1/2)), then 


n+ 2 n+1 = 2 
Ji Ji 
= -W - —; (10.5.4) 
ae 


where W is the MWW test statistic, (10.4.5). Hence the special case of a linear 
score function results in the MWW test statistic. 


To complete the decision rule (10.5.2), we need the null distribution of the test 
statistic W,,. But many of its properties follow along the same lines as that of the 
MWW test. First, W, is distribution free because, under the null hypothesis, every 
subset of ranks for the Yjs is equilikely. In general, the distribution of W,, cannot 
be obtained in closed form, but it can be generated recursively similarly to the 
distribution of the MWW test statistic. Next, to obtain the null mean of W,, use 
the fact that R(Y;) is uniform on the integers 1,2,...,n. Because >", a,(i) = 0, 
we then have 


Enig(We) =) Erg (ag(RO))) = D> > agli) = = 0. (10.5.5) 


Bg (a2 (ROG))) = 0) = = — Ya (i) = =? (10.5.6) 
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As Exercise 10.5.4 shows, s2/n = 1. Since Ex,(W,) = 0, we have 


Vary (We) = Eny(W) = oS De Brie lag R(YG))ap(R(¥5"))) 
= 3 En lag(R(Y5))] + D0 > Ex lae(R(Y))) a9 (R(Y5"))] 
j=l TAS! 
ap FES, a (10.5.7) 
= an 1) (10.5.8) 


see Exercise 10.5.2 for the derivation of the second term in expression (10.5.7). In 
more advanced books, it is shown that W, is asymptotically normal under Ho. 
Hence the corresponding asymptotic decision rule of level @ is 


ue > Ze. (10.5.9) 


\/ Varn, (We) ~ 


To answer the questions posed in the first paragraph of this section, the efficacy 
of the test statistic W, is needed. To proceed along the lines of the last section, 
define the process 


Reject Ho in favor of Hy if z = 


W,(A) = S a,(R(Y; — A)), (10.5.10) 


where R(Y;—A) denotes the rank of Y;— A among X1,...,Xn,,Y¥i—A,..., Yn. —-A. 
In the last section, the process for the MWW statistic was also written in terms of 
counts of the differences Y; — X;. We are not as fortunate here, but as the next 
theorem shows, this general process is a simple decreasing step function of A. 


Theorem 10.5.1. The process W,(A) is a decreasing step function of A which 
steps down at each difference Y; —X;,i=1,...,n, andj =1,...,ng. Its maximum 
and minimum values are baie ay(j) > 0 and yt ay(j) <0, respectively. 


Proof: Suppose A, < A, and W,(Ai) 4 W,(A2). Hence the assignment of the 
ranks among the X; and Y; — A must differ at A, and Ag; that is, then there must 
be a j and an 7 such that Y; — Ap < X; and Y; — A; > X;. This implies that 
A, < Y; — X; < Ag. Thus W,(A) changes values at the differences Y; — X;. To 
show it is decreasing, suppose A; < Y;—X; < Ag and there are no other differences 
between A; and Ag. Then Y; — A; and X; must have adjacent ranks; otherwise, 
there would be more than one difference between A; and Ag. Since Y; — A; > X; 
and Y; — Ag < Xj, we have 


R(Y; = Ai) = R(X;) +1 and R(Y; Sd Ao) = R(X;) —1. 


Also, in the expression for W,(A), only the rank of the Y; term has changed in the 
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interval [A;, A]. Therefore, since the scores are nondecreasing, 


We(A1)—We(A2) = Sag(R(Ye — Ar) +,(R(Yj — Ar) 
kAj 


— |>5 ay(R(%, — A2)) + ap(R(Y; — Ae) 
kAj 
Qy(R(X;) + 1)) — ag(R(Xi) — 1)) = 0. 


Because W,,(A) is a decreasing step function and steps only at the differences Y; — 
X;, its maximum value occurs when A < Y; — Xj, for all 7,7, i.e., when X; < Y;—A, 
for all i, 7. Hence, in this case, the variables Y; — A must get all the high ranks, so 


n 


max Wy (A) = Ss” ae). 


j=njt+1 


Note that this maximum value must be nonnegative. For suppose it was strictly 
negative, then at least one a,(j) < 0 for j =n; + 1,...,n. Because the scores are 
nondecreasing, a,(i) < 0 for alli =1,...,m1. This leads to the contradiction 


n n n1 


0> > ap(j) = S> ag(j) +) > ap(7) = 0. 


j=mit1 j=mt+1 j=1 


The results for the minimum value are obtained in the same way; see Exercise 10.5.6. 
| 


As Exercise 10.5.7 shows, the translation property, Lemma 10.2.1, holds for the 
process W,,(A). Using this result and the last theorem, we can show that the power 
function of the test statistic W., for the hypotheses (10.5.1) is nondecreasing. Hence 
the test is unbiased. 


10.5.1 Efficacy 


We next sketch the derivation of the efficacy of the test based on W,,. Our arguments 
can be made rigorous; see advanced texts. Consider the statistic given by the average 


W,(0) = “W-(0). (10.5.11) 
Based on (10.5.5) and (10.5.8), we have 

[up(0) = Eo(W,(0)) =0 and o2 = Varo(W,(0)) = #4 en-?s3. — (10.5.12) 
Notice from Exercise 10.5.4 that the variance of W,,(0) is of the order O(n~?). We 


have 


ye(A) = Bal p(0)] = EolW(-A)] = 1’ Bolag(R(V%j+A))].  (10.5.13) 
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Suppose that FE, and Eo are the empirical cdfs of the random samples X1,..., Xn, 


and Yj,..., Yn,, respectively. The relationship between the ranks and empirical cdfs 
follows as 
R(Y¥; +A) = #h{% +A<V;+A}+#{X%i <¥; +A} 
= #{%% <¥j}+#{X < ¥) +A} 
n2Fn,(¥j) +m Fn, (Yj + A). (10.5.14) 


Substituting this last expression into expression (10.5.13), we get 


wld) = 29 Fo {| 2 F005) + Fan(s + 4)] (10.5.5) 
+ \eEo {oF (Y) + MF(Y + A)]} (10.5.16) 
a jy iZ y [\oF(Y) + a F(Y + A)] f(y) dy. (10.5.17) 


The limit in expression (10.5.16) is actually a double limit, which follows from 
F., (x) > F(x), i = 1,2, under Hp, and the observation that upon substituting F’ for 
the empirical cdfs in expression (10.5.15), the sum contains identically distributed 
random variables and, thus, the same expectation. These approximations can be 
made rigorous. It follows immediately that 


eld) = 2 fo DaF(Y)+ MF + AFF A/a. 
Hence 86 
(0) = Arde / oF WF (y) dy. (10.5.18) 


=00 


ny ynN2 1 1 pa 
Vino = Vn, / n(n—l)VnV no Vv Ayr2. (10.5.19) 


Based on (10.5.18) and a the efficacy of W, is given by 


Cy= tim -- Ts f?(y) dy. (10.5.20) 


Using the efficacy, the asymptotic power can be derived for the test statistic 
W,. Consider the sequence of local alternatives given by (10.4.14) and the level a 
asymptotic test based on W,. Denote the power function of the test by y,(An). 
Then it can be shown that 


From (10.5.12), 


lim a Ye(An )=1- ®(z_ — ¢,4), (10.5.21) 
where ®(z) is the cdf of a standard normal random variable. Sample size deter- 
mination based on the test statistic W, proceeds as in the last few sections; see 
Exercise 10.5.8. 
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10.5.2 Estimating Equations Based on General Scores 


Suppose we are using the scores a,(z) = y(t/(n + 1)) discussed in Section 10.5.1. 
Recall that the mean of the test statistic W,, is 0. Hence the corresponding estimator 
of A solves the estimating equations 


W,(A) = 0. (10.5.22) 
By Theorem 10.5.1, W,(Q) is a decreasing step function of A. Furthermore, the 
maximum value is positive and the minimum value is negative (only degenerate 
cases would result in one or both of these as 0); hence, the solution to equation 
(10.5.22) exists. Because W,(A) is a step function, it may not be unique. When 
it is not unique, though, as with Wilcoxon and median procedures, there is an 
interval of solutions, so the midpoint of the interval can be chosen. This is an 
easy equation to solve numerically because simple iterative techniques such as the 
bisection method or the method of false position can be used; see the discussion on 
page 210 of Hettmansperger and McKean (2011). The asymptotic distribution of 
the estimator can be derived using the asymptotic power lemma and is given by 


Ay has an approximate N (4, ce (4 + +)) distribution, (10.5.23) 
where 
love) —1 
re= [fo Pwlroa) (10.5.24) 


Hence the efficacy can be expressed as cy = VXDaITS l As Exercise 10.5.9 shows, 
the parameter Ty is a scale parameter. Since the efficacy is cp = VAiAgTZ", the 
efficacy varies inversely with scale. This observation proves helpful in the next 
subsection. 


10.5.3. Optimization: Best Estimates 


We can now answer the questions posed in the first paragraph. For a given pdf 
f(x), we show that in general we can select a score function that maximizes the 
power of the test and minimizes the asymptotic variance of the estimator. Under 
certain conditions we show that estimators based on this optimal score function 
have the same efficiency as maximum likelihood estimators (mles); i-e., they obtain 
the Rao—Cramér Lower Bound. 

As above, let Xy,...,X,, be a random sample from the continuous cdf F'(«) 
with pdf f(a). Let Y1,...,¥,;, be arandom sample from the continuous cdf F'(a#—A) 
with pdf f(a—A). The problem is to choose y to maximize the efficacy c, given in 
expression (10.5.20). Note that maximizing the efficacy is equivalent to minimizing 
the asymptotic variance of the corresponding estimator of A. 

For a general score function y(u), consider its efficacy given by expression 
(10.5.20). Without loss of generality, the relative sample sizes in this expression 


10.5. *General Rank Scores 613 


can be ignored, so we consider c*, = (ViA2)~ "cy. If we make the change of vari- 
ables u = F'(y) and then integrate by parts, we get 


io) 
* 
lI 
8 
Ss 
a 
S 
= 
SS 
bo 
Fes 
= 
a 
S 


- [ whi) ai du. (10.5.25) 


Recall that the score function f y?(u)du = 1. Thus we can state the problem 
as 


1 f(E7*(u)) 7 

{fo p(w) [- a du} if eon ae 
1 Lf ¢(F-1(u)) 1? : 

~ is y?(u) du i — — du 0 


2 


The quantity that we are maximizing in the braces of this last expression, how- 
ever, is the square of a correlation coefficient, which achieves its maximum value 1. 
Therefore, by choosing the score function y(u) = yy(u), where 


Sf(F(u)) 


pr(u) = FRAT ay’ (10.5.26) 


and « is a constant chosen so that [ y;(u) du = 1, then the correlation coefficient 
is 1 and the maximum value is 


I(f) = [ own du, (10.5.27) 


which is Fisher information for the location model. We call the score function given 
by (10.5.26) the optimal score function. 

In terms of estimation, if A is the corresponding estimator, then, according to 
(10.5.24), it has the asymptotic variance 


ae nl (+ 4 +) . (10.5.28) 


Thus the estimator A achieves asymptotically the Rao—Cramér lower bound; that 
is, A is an asymptotically efficient estimator of A. In terms of asymptotic relative 
efficiency, the ARE between the estimator A and the mle of A is 1. Thus we have 
answered the second question of the first paragraph of this section. 

Now we look at some examples. The initial example assumes that the distri- 
bution of ¢; is normal, which answers the leading question at the beginning of this 
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section. First, though, note an invariance that simplifies matters. Suppose Z is 
a scale and location transformation of a random variable X; i.e, Z = a(X — b), 
where a > 0 and —co < b < ov. Because the efficacy varies indirectly with scale, we 
have c, = a~°c%,.. Furthermore, as Exercise 10.5.9 shows, the efficacy is invariant 
to location and, also, I(fz) = a~?I(fx). Hence the quantity maximized above is 
invariant to changes in location and scale. In particular, in the derivation of optimal 
scores, only the form of the density is important. 


Example 10.5.1 (Normal Scores). Suppose the error random variable ¢; has a 
normal distribution. Based on the discussion in the last paragraph, we can take the 
pdf of a N(0,1) distribution as the form of the density. So consider fz(z) = ¢(z) = 
(20)~1/? exp{—27127}. Then —¢'(z) = z¢(z). Let &(z) denote the cdf of Z. Hence 
the optimal score function is 
P(@*(u)) _ 5-1 
N(u) = —k———— = © “(u); 10.5.29 
en(u) =e eet = OM (10.5.29) 
see Exercise 10.5.5, which shows that « = 1 as well as that [ yn(u)du = 0. The 
corresponding scores, ay (i) = ®~!(i/(n+1)), are often called the normal scores. 
Denote the process by 


Wy (A) = 3 &-(R(Y; — A)/(n + 1)]. (10.5.30) 


The associated test statistic for the hypotheses (10.5.1) is the statistic Wy = Wy (0). 
The estimator of A solves the estimating equations 


Wn (Ay) © 0. (10.5.31) 


Although the estimate cannot be obtained in closed form, this equation is relatively 
easy to solve numerically. From the above discussion, ARE(Ay, Y — X) = 1 at the 
normal distribution. Hence normal score procedures are fully efficient at the normal 
distribution. Actually, a much more powerful result can be obtained for symmet- 


ric distributions. It can be shown that ARE(Ay,Y — X) > 1 at all symmetric 
distributions. m 


Example 10.5.2 (Wilcoxon Scores). Suppose the random errors, ¢;, i = 1,2,...,n, 
have a logistic distribution with pdf fz(z) = exp{—z}/(1+ exp{—z})?. Then the 
corresponding cdf is Fz(z) = (1 + exp{—z})~+. As Exercise 10.5.11 shows, 


— 142) = Py(z)(1—exp{—z}) and Fz'(u) =log py. (10.5.32) 
Upon standardization, this leads to the optimal score function, 


pw (u) = V12(u — (1/2)), (10.5.33) 
that is, the Wilcoxon scores. The properties of the inference based on Wilcoxon 
scores are discussed in Section 10.4. Let Aw = med {Y; — X;} denote the corre- 
sponding estimate. Recall that ARE(Aw,Y — X) = 0.955 at the normal. Hodges 


and Lehmann (1956) showed that ARE(Aw,Y — X) > 0.864 over all symmetric 
distributions. ™ 
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Table 10.5.1: Data for Example 10.5.3 


Sample 1 (X) Sample 2 (Y) 

Data | Ranks | Normal Scores | Data | Ranks | Normal Scores 
51.9 15 —0.04044 59.2 24 0.75273 
56.9 23 0.64932 49.1 14 —0.12159 
45.2 11 —0.37229 54.4 19 0.28689 
52.3 16 0.04044 47.0 13 —0.20354 
59.5 26 0.98917 55.9 21 0.46049 
41.4 4 —1.13098 34.9 3 —1.30015 
46.4 12 —0.28689 62.2 28 1.80015 
45.1 10 —0.46049 41.6 6 —0.86489 
53.9 17 0.12159 59.3 25 0.86489 
42.9 7 —0.75273 32.7 1 —1.84860 
41.5 5 —0.98917 72.1 29 1.51793 
55.2 20 0.37229 43.8 8 —0.64932 
32.9 2 —1.51793 56.8 22 0.55244 
54.0 18 0.20354 76.7 30 1.84860 
45.0 9 —0.55244 60.3 27 1.13098 


Example 10.5.3. As a numerical illustration, we consider some generated nor- 
mal observations. The first sample, labeled X, was generated from a N(48, 10?) 
distribution, while the second sample, Y, was generated from a N(58, 107) distribu- 
tion. The data are displayed in Table 10.5.1, but they can also be found in the file 
examp1053.rda. Also in Table 10.5.1, the ranks and the normal scores are exhib- 
ited. We consider tests of the two-sided hypotheses Hp : A = 0 versus Hy: A#O 
for the Wilcoxon, normal scores, and Student ¢t procedures. The next segment of R 
code returns the results in Table 10.5.2. As we have used the R functions t.test 
and wilcox.test in the last section we do not show their results in the segment 
but we do show the results for the normal scores. The code assumes that the R 
vectors x and y contain the respective samples. 

t.test(y,x); wilcox.test(y,x,conf.int=T) 

zed=c(x,y); ind=c(rep(0,15) ,rep(1,15)); rz=rank(z) 

phis=qnorm(rz/31); varns= ((15*15)/(30*29) ) *sum(phis*2) 

nstst=sum(ind*phis); stdns=nstst/sqrt (varns) 

pns =2*(1-pnorm(abs(stdns) )) 

nstst; stdns; pns 

3.727011; 1.483559; 0.137926 
To complete the summary in Table 10.5.2 we need the estimate of A based on the 
rank-based normal scores process. Kloke and McKean (2014) discuss the use of the 
CRAN package Rfit for this computation. If this package is installed in the users 
area then the following command computes this estimate of A: 

rfit (zed~ind,scores=nscores) $coef [2] 

5.100012 


616 Nonparametric and Robust Statistics 


Table 10.5.2: Summary of analyses for Example 10.5.3 


Standardized 


Student t 


Wilcoxon 
Normal scores 


Notice that the standardized tests statistics and their corresponding p-values are 
quite similar and all would result in the same decision regarding the hypotheses. 
As shown in the table, the corresponding point estimates of A are also alike. 

We changed x5 to be an outlier with value 95.5 and then reran the analyses. The 
t-analysis was the most affected, for on the changed data, t = 0.63 with a p-value 
of 0.53. In contrast, the Wilcoxon analysis was the least affected (z = 1.37 and 
p = 0.17). The normal scores analysis was more affected by the outlier than the 
Wilcoxon analysis with z = 1.14 and p= 0.25. mg 


Example 10.5.4 (Sign Scores). For our final example, suppose that the ran- 
dom errors €1,€2,..-,€n have a Laplace distribution. Consider the convenient 
form fz(z) = 2~texp{—|z|}. Then fZ(z) = —27'sgn(z) exp{—|z|} and, hence, 
—f,(Fz'(u))/fz(Fz*(u)) = sgn(z). But F7'(u) > 0 if and only if u > 1/2. The 
optimal score function is 

yps(u) = sgn (« - 5) ; (10.5.34) 


which is easily shown to be standardized. The corresponding process is 


na 


Ws(A) = sen | R(Y5 ~ A) - 


n+1 
= | (10.5.35) 
Because of the signs, this test statistic can be written in a simpler form, which is 
often called Mood’s test; see Exercise 10.5.13. 
We can also obtain the associated estimator in closed form. The estimator solves 
the equation 


S~ sen ROY ch i a * =0. (10.5.36) 
j=l 


For this equation, we rank the variables 
{Misies phage V1 = yeexi tng =i}; 


Because ranks, though, are invariant to a constant shift, we obtain the same ranks 
if we rank the variables 


Xy = med{X;}, oaeg Xny = med{X;}, Yj -A- med{ X;}, eae Ying -A- med{X;}. 
Therefore, the solution to equation (10.5.36) is easily seen to be 


As = med{Y;} — med{X;}. (10.5.37) 
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Other examples are given in the exercises. 


EXERCISES 


10.5.1. In this section, as discussed above expression (10.5.2), the scores a,(i) are 
generated by the standardized score function y(wu); that is, i y(u)du = 0 and 


ie y?(u) du = 1. Suppose that ~(u) is a square-integrable function defined on the 
interval (0,1). Consider the score function defined by 


b(u) —% 
pu) = Pit) Bede’ 
So lb) - oP 
where ( = fov v) dv. Show that y(w) is a standardized score function. 


10.5.2. Complete the derivation of the null variance of the test statistic W, by 
showing the second term in expression (10.5.7) is true. 

Hint: Use the fact that under Ho, for j # j’, the pair (a,(R(Y;)), ay(R(Y;’))) is 
uniformly distributed on the pairs of integers (7, 7’), 7,7’ = 1,2,...,n,i47. 


10.5.3. For the Wilcoxon score function y(u) = V12[u—(1/2)], obtain the value of 
Sq. Then show that the Vz,(W.) given in expression (10.5.8) is the same (except 
for standardization) as the variance of the MWW statistic of Section 10.4. 


10.5.4. Recall that the scores have been standardized so that ies (uw) du = 1. 


Use this and a Riemann sum to show that n~'s? — 1, where oA is defined in 


expression (10.5.6). 


10.5.5. Show that the normal scores, (10.5. ay derived in Example 10.5.1 are 
standardized; that is, Ne ypn(u) du =0 and ie pe (u) du = 1, 


10.5.6. In Theorem 10.5.1, show that the minimum value of W,,(A) is given by 
i21 Go(9) and that it is nonpositive. 


10.5.7. Show that E,[W,(0)] = Zo[W,(—A)]. 


10.5.8. Consider the hypotheses (10.4.4). Suppose we select the score function 
y(u) and the corresponding test based on W,,. Suppose we want to determine the 
sample size n = n, +2 for this test of significance level a to detect the alternative 
A* with approximate power 7*. Assuming that the sample sizes n; and nz are the 
same, show that 


2 
ne ae (10.5.38) 


10.5.9. In the context of this section, show the following invariances: 


(a) Show that the parameter T,, (10.5.24), is a scale functional as defined in 
Exercise 10.1.4. 
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(b) Show that part (a) implies that the efficacy, (10.5.20), is invariant to the 
location and varies indirectly with scale. 


(c) Suppose Z is a scale and location transformation of a random variable X; i.e., 
Z = a(X —b), where a > 0 and —oo < b < oo. Show that I(fz) = a~?I(fx). 


10.5.10. Consider the scale parameter 7,,, (10.5.24), when normal scores are used; 
ie., p(u) = &-1(u). Suppose we are sampling from a N(y,07) distribution. Show 
that T, =o. 


10.5.11. In the context of Example 10.5.2, obtain the results in expression (10.5.32). 
10.5.12. Let the scores a(i) be generated by a,(z) = pli/(n + 1)], fori =1,...,n, 


where i y(u) du = 0 and ie y?(u) du = 1. Using Riemann sums, with subintervals 
of equal length, of the integrals ifs y(u) du and a y?(u) du, show that >>", a(i) +0 
and S7_, a?(i) en. 
10.5.13. Consider the sign scores test procedure discussed in Example 10.5.4. 
(a) Show that Ws = 2Wi — nz, where Wg = #; {R(Y;) > 4}. Hence Wg is 
an equivalent test statistic. Find the null mean and variance of Wg. 


(b) Show that Wé = #,; {Y; > 6*}, where 6* is the combined sample median. 


(c) Suppose n is even. Letting Wyo = #: {Xi > 0*}, show that we can table Wg 
in the following 2 x 2 contingency table with all margins fixed: 


areas (eee el eee ee 


1 


No. items < 6° 
es ee ee ee 


Show that the usual y? goodness-of-fit is the same as Z2, where Zs is the 
standardized z-test based on Wg. This is often called Mood’s median test; 
see Example 10.5.4. 


10.5.14. Recall the data discussed in Example 10.5.3. 
(a) Obtain the contingency table described in Exercise 10.5.13. 


(b) Obtain the x? goodness-of-fit test statistic associated with the table and use 
it to test at level 0.05 the hypotheses Hp : A =O versus H,: A#0. 


(c) Obtain the point estimate of A given in expression (10.5.37). 


10.5.15. Optimal signed-rank based methods also exist for the one-sample problem. 
In this exercise, we briefly discuss these methods. Let X1, X2,...,Xn follow the 
location model 

X,=6+e, (10.5.39) 


where €1, €2,...,€, are iid with pdf f(a), which is symmetric about 0; ie., f(—x) = 


f (2). 
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(a) 


(b) 


(c) 


(d) 


(e 


a 


(f) 
(g) 


Show that under symmetry the optimal two-sample score function (10.5.26) 
satisfies 
pr(l—u)=—yz(u), O<u<l; (10.5.40) 


that is, yr(u) is an odd function about $. Show that a function satisfying 


(10.5.40) is 0 at u= 5. 


For a two-sample score function y(u) that is odd about 4, define the function 
yt (u) = y|(u+1)/2], ie., the top half of y(u). Note that the domain of y* (u) 
is the interval (0,1). Show that yt (u) > 0, provided y(u) is nondecreasing. 


Assume for the remainder of the problem that y*(u) is nonnegative and non- 
decreasing on the interval (0,1). Define the scores a*(i) = yt[i/(n + 1)], 


i= 1,2,...,n, and the corresponding statistic 
Wo+ = >_ sen(Xi)a* (R[X;)). (10.5.41) 
i=1 


Show that W,+ reduces to a linear function of the signed-rank test statistic 
(10.3.2) if p(w) = 2u— 1. 


Show that W,+ reduces to a linear function of the sign test statistic (10.2.3) 
if p(w) = sgn(2u — 1). 


Note: Suppose Model (10.5.39) is true and we take y(u) = yr(u), where 
yy (u) is given by (10.5.26). If we choose yt (u) = y[(u+1)/2] to generate the 
signed-rank scores, then it can be shown that the corresponding test statistic 
W,+ is optimal, among all signed-rank tests. 


Consider the hypotheses 
Ho: 6=0 versus H,: 0>0. 


Our decision rule for the statistic W,+ is to reject Ho in favor of Hy if W,+ = 
k, for some k. Write W,+ in terms of the anti-ranks, (10.3.5). Show that W,+ 
is distribution-free under Ho. 


Determine the mean and variance of W,+ under Ho. 


Assuming that, when properly standardized, the null distribution is asymp- 
totically normal, determine the asymptotic test. 


10.6 *Adaptive Procedures 


In the last section, we presented fully efficient rank-based procedures for testing and 
estimation. As with mle methods, though, the underlying form of the distribution 
must be known in order to select the optimal rank score function. In practice, 
often the underlying distribution is not known. In this case, we could select a score 
function, such as the Wilcoxon, which is fairly efficient for moderate- to heavy-tailed 
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error distributions. Or if the distribution of the errors is thought to be quite close 
to a normal distribution, then the normal scores would be a proper choice. Suppose 
we use a technique that bases the score selection on the data. These techniques are 
called adaptive procedures. Such a procedure could attempt to estimate the score 
function; see, for example, Naranjo and McKean (1997). However, large data sets 
are often needed for these. There are other adaptive procedures that attempt to 
select a score from a finite class of scores based on some criteria. In this section, we 
look at an adaptive testing procedure that retains the distribution-free property. 


Frequently, an investigator is tempted to evaluate several test statistics associ- 
ated with a single hypothesis and then use the one statistic that best supports his 
or her position, usually rejection. Obviously, this type of procedure changes the 
actual significance level of the test from the nominal a that is used. However, there 
is a way in which the investigator can first look at the data and then select a test 
statistic without changing this significance level. For illustration, suppose there are 
three possible test statistics, W,,W2, and W3, of the hypothesis Ho with respective 
critical regions C,, C2, and C3 such that P(W; € Cj; Ho) =a, i = 1,2,3. Moreover, 
suppose that a statistic Q, based upon the same data, selects one and only one of 
the statistics W,,W2,W3, and that W is then used to test Ho. For example, we 
choose to use the test statistic W; if Q € Dj, 1 = 1,2,3, where the events defined 
by D,, D2, and Ds are mutually exclusive and exhaustive. Now if Q and each W; 
are independent when Hp is true, then the probability of rejection, using the entire 
procedure (selecting and testing), is, under Ho, 


Pr, (Q € Di,Wi € Ci) + Pr, (Q € Do, W2 € C2) + Pr, (Q € D3, W3 € C3) 
= Py, (Q € Di)Pu. (Wi € C1) + Pa, (Q € D2) Pr, (W2 € C2) 
+ Py,(Q € D3)PH,(W3 € C3) 
=al[Px,(Q € D1) + Pa, (Q € D2) + Px, (Q € Ds)] =a. 


That is, the procedure of selecting W; using an independent statistic Q and then 
constructing a test of significance level a with the statistic W; has overall significance 
level a. 


Of course, the important element in this procedure is the ability to be able to 
find a selector Q that is independent of each test statistic W. This can frequently be 
done by using the fact that complete sufficient statistics for the parameters, given by 
Ho, are independent of every statistic whose distribution is free of those parameters. 
For illustration, if independent random samples of sizes n, and nz arise from two 
normal distributions with respective means fz; and jig and common variance 07, 


then the complete sufficient statistics X,Y, and 


Vax - XP + 0M -¥/P 


for 11,2, and o? are independent of every statistic whose distribution is free of 
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[1, [t2, and o”, such as the statistics 


ny ni 


S0(X%i-X)? S°|X; — median(X;)| 


: range(X1, Xo,---,Xn1) 
a 2 range(Y1, Yo,---,Yno) — 

DM - YY DIY — median(¥)| a : 
- 1 


Thus, in general, we would hope to be able to find a selector Q that is a function 
of the complete sufficient statistics for the parameters, under Ho, so that it is 
independent of the test statistic. 

It is particularly interesting to note that it is relatively easy to use this technique 
in nonparametric methods by using the independence result based upon complete 
sufficient statistics for parameters. For the situations here, we must find complete 
sufficient statistics for a cdf, F’, of the continuous type. In Chapter 7, it is shown 
that the order statistics Yj < Yo <--: < Y, of a random sample of size n from a 
distribution of the continuous type with pdf F’(#) = f(a) are sufficient statistics 
for the “parameter” f (or F'). Moreover, if the family of distributions contains all 
probability density functions of the continuous type, the family of joint probability 
density functions of Y1, Y2,...,Y, is also complete. That is, the order statistics 
Y,, Y,..., Yn are complete sufficient statistics for the parameters f (or F). 

Accordingly, our selector Q is based upon those complete sufficient statistics, the 
order statistics under Hg. This allows us to independently choose a distribution- 
free test appropriate for this type of underlying distribution, and thus increase the 
power of our test. 

A statistical test that maintains the significance level close to a desired signif- 
icance level a for a wide variety of underlying distributions with good (not neces- 
sarily the best for any one type of distribution) power for all these distributions is 
described as being robust. As an illustration, the pooled t-test (Student’s t) used to 
test the equality of the means of two normal distributions is quite robust provided 
that the underlying distributions are rather close to normal ones with common vari- 
ance. However, if the class of distributions includes those that are not too close to 
normal ones, such as contaminated normal distributions, the test based upon t is 
not robust; the significance level is not maintained and the power of the t-test can 
be quite low for heavy-tailed distributions. As a matter of fact, the test based on 
the Mann—Whitney—Wilcoxon statistic (Section 10.4) is a much more robust test 
than that based upon t if the class of distributions includes those with heavy tails. 

In the following example, we illustrate a robust, adaptive, distribution-free pro- 
cedure in the setting of the two-sample problem. 


Example 10.6.1. Let Xi, X2,...,Xn, be a random sample from a continuous- 
type distribution with cdf F(x) and let Yi, Y2,...,Y,, be a random sample from a 
distribution with cdf F(a — A). Let n = n1 + n2 denote the combined sample size. 
We test 

Hj: A=O versus H,: A> 0, 


by using one of four distribution-free statistics, one being the Wilcoxon and the 
other three being modifications of the Wilcoxon. In particular, the test statistics 
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are 
ne 


Wi = Dl a(RO)], 1=1,2,3,4, (10.6.1) 


where 
ai(j) = gild/(n + 1)], 


and the four functions are displayed in Figure 10.6.1. The score function y,(w) 
is the Wilcoxon. The score function y2(u) is the sign score function. The score 
function y3(u) is good for short-tailed distributions, and y4(u) is good for long, 
right-skewed distributions with shift alternatives. 


¢\(u) $p(u) 
A A 
t= u tu 
1 1 
$3(u) d4(u) 
A A 
— +> u +> u 
1 1 


Figure 10.6.1: Plots of the score functions yi(u), Y2(w), ys(u), and ya(u). 


We combine the two samples into one denoting the order statistics of the com- 
bined sample by Vi < V2 < --- < V,. These are complete sufficient statistics for 
F(a) under the null hypothesis. For i = 1,...,4, the test statistic W; is distribution 
free under Hp and, in particular, the distribution of W; does not depend on F(z). 
Therefore, each W; is independent of Vi, V2,..., Vn. We use a pair of selector statis- 
tics (Q1, Q2), which are functions of V,, V2,..., Vn, and hence are also independent 
of each W;. The first is 
_ Uo -M 5 


= ee 10.6.2 
M5 — Los ( ) 


where U.o5, M.5, and L.95 are the averages of the largest 5% of the Vs, the middle 
50% of the Vs, and the smallest 5% of the Vs, respectively. If Q is large (say 2 
or more), then the right tail of the distribution seems longer than the left tail; that 
is, there is an indication that the distribution is skewed to the right. On the other 
hand, if Qi < 5, the sample indicates that the distribution may be skewed to the 
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left. The second selector statistic is 


Us — L 
j= =. (10.6.3) 
Us—Ls 


Large values of Q2 indicate that the distribution is heavy-tailed, while small values 
indicate that the distribution is light-tailed. Rules are needed for score selection, 


and here we make use of the benchmarks proposed in an article by Hogg et al. 
(1975). These rules are tabulated below, along with their benchmarks: 


Distribution Indicated | Score Selected 
Heavy-tailed symmetric 
Qi > 2 and Q2 <7 | Right-skewed 


Qi <2 and Q2 < 2 | Light-tailed symmetric 
Moderate heavy-tailed 


Hogg et al. (1975) performed a Monte Carlo power study of this adaptive proce- 
dure over a number of distributions with different kurtosis and skewness coefficients. 
In the study, both the adaptive procedure and the Wilcoxon test maintain their a 
level over the distributions, but the Student t does not. Moreover, the Wilcoxon test 
has better power than the t-test, as the distribution deviates much from the normal 
(kurtosis = 3 and skewness = 0), but the adaptive procedure is much better than 
the Wilcoxon for the short-tailed distributions, the very heavy-tailed distributions, 
and the highly skewed distributions that are considered in the study. 


Remark 10.6.1 (Computation for the Adaptive Procedure). An R implementation 
of Hogg’s adaptive procedure as discussed in Example 10.6.1 can be found in the R 
package npsm developed by Kloke and McKean (2014); see their Section 3.6. The 
R function is hogg.test. For illustration, consider the normal data discussed in 
Example 10.5.3. Here are the code and results: 

load("examp1053.rda"); hogg.test(y,x) 

Scores Selected: Wilcoxon; p.value 0.11984 
Hence, for this data, Hogg’s procedure selected Wilcoxon scores. As another ex- 
ample, consider the waterwheel data given in Example 10.4.1. In this case the 
computation results in: 

load("waterwheel.rda"); hogg.test(grp2,grp1) 

Scores Selected: bent; p.value 0.63494 
The selected score is the bent score which is the score function y4(u) in Hogg’s 
procedure. As the boxplot for the combined samples indicates the data are right- 
skewed, an indication that the score selection is appropriate. @ 


The adaptive distribution-free procedure that we have discussed is for testing. 
Suppose we have a location model and were interested in estimating the shift in 
locations A. For example, if the true F is a normal cdf, then a good choice for 
the estimator of A would be the estimator based on the normal scores procedure 
discussed in Example 10.5.1. The estimators, though, are not distribution free and, 
hence, the above reasoning does not hold. Also, the combined sample observations 
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Xq,...,Xn,,¥1,---, Yn. are not identically distributed. There are adaptive proce- 
dures based on residuals Xy,...,Xn,,¥i — A,..-,¥n, — A, where A is an initial 
estimator of A; see page 237 of Hettmansperger and McKean (2011) for discussion 


and Section 7.6 of Kloke and McKean (2014) for an R implementation. 


EXERCISES 


10.6.1. In Exercises 10.6.2 and 10.6.3, the student is asked to apply the adaptive 
procedure described in Example 10.6.1 to real data sets. The hypotheses of interest 
are 

Ho: A=O versus H,: A> 0, 


where A = py — px. The four distribution-free test statistics are 


Wi = dl ai(RO)], 1=1,2,3,4, (10.6.4) 


j=l 
where 

ai(J) = vali /(n + 1), 
and the score functions are given by 


gi(u) = 2-1, O<u<l 
ye(u) = sgn(2u-—1), 0O<u<l 
4u—1 O0<u<4 
y3(u) = 0 t<u<# 
4u-3 $<u<l 
7 4u— (3/2) O<u<i 
pa(u) — ee - 


Note that we have adjusted the fourth score y4(w) in Figure 10.6.1 so that it inte- 
grates to 0 over the interval (0, 1). 

The theory of Section 10.5 states that, under Ho, the distribution of W; is 
asymptotically normal with mean 0 and variance 


nyzn2 


Varn (We) = 2 | <> 02(3) 


n—-1 ; 
j=1 


Note, however, that the scores have not been standardized, so their squares integrate 
to 1 over the interval (0,1). Hence, do not replace the term in brackets by 1. If 
ny, = nz = 15, find Vary, (W;), for? =1,...,4. 


10.6.2. Consider the data in Example 10.5.3 and the hypotheses 
Hy: A=O versus H,: A>0O, 


where A = ty — wx. Apply the adaptive procedure described in Example 10.6.1 
with the tests defined in Exercise 10.6.1 to test these hypotheses. Obtain the p-value 
of the test. 
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10.6.3. Let F(a) be a distribution function of a distribution of the continuous type 
that is symmetric about its median 6. We wish to test Hp : @ = O against H; :@ > 0. 
Use the fact that the 2n values, X; and —X;, i = 1,2,...,n, after ordering, are 
complete sufficient statistics for F', provided that Ho is true. 


(a) As in Exercise 10.5.15, determine the one-sample signed-rank test statistics 
corresponding to the two-sample score functions y)(w), yo(u), and y3(u) de- 
fined in the last exercise. Use the asymptotic test statistics. Note that these 
score functions are odd about 4; hence, their top halves serve as score func- 
tions for signed-rank statistics. 


(b) We are assuming symmetric distributions in this problem; hence, we use only 
Q2 as our score selector. If Q2 > 7, then select yo(u); if 2 < Qe < 7, then 
select yi(u); and finally, if Q2 < 2, then select y3(u). Construct this adaptive 
distribution-free test. 


(c) Use your adaptive procedure on Darwin’s Zea mays data; see Example 10.3.1. 
Obtain the p-value. 


10.7 Simple Linear Model 


In this section, we consider the simple linear model and briefly develop the rank- 
based procedures for it. 
Suppose the responses Yj, Y2,..., Yn follow the model 


Y,=a+(B(a,-F)+e;, 1=1,2,...,n, (10.7.1) 
where €),€2,...,€,» are iid with continuous cdf F(x) and pdf f(a). In this model, the 
variables 71,%2,...,@» are considered fixed. Often x is referred to as a predictor 


of Y. Also, the centering, using Z, is for convenience (without loss of generality) 
and we do not use it in the examples of this section. The parameter ( is the slope 
parameter, which is the expected change in Y (provided expectations exist) when 
x increases by one unit. A natural null hypothesis is 


Ho: 6=0 versus H,: 60. (10.7.2) 


Under Ho, the distribution of Y is free of x. 

In Chapter 3 of Hettmansperger and McKean (2011), rank-based procedures 
for linear models are presented from a geometric point of view; see also Exercises 
10.9.11-10.9.12 of Section 10.9. Here, it is easier to present a development which 
parallels the preceding sections. Hence we introduce a rank test of Ho and then 
invert the test to estimate G. Before doing this, though, we present an example that 
shows that the two-sample location problem of Section 10.4 is a regression problem. 


Example 10.7.1. As in Section 10.4, let X1, X2,..., Xn, be arandom sample from 
a distribution with a continuous cdf F(# — a), where a is a location parameter. 
Let Yi, Yo,...,¥n, be a random sample with cdf F(a — a— A). Hence A is the 
shift between the cdfs of X; and Y;. Redefine the observations as Z; = X;, for 
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t=1,...,m, and Z,,4; = Yi, for? = ny +1,...,n, where n = 71 + ng. Let c; be 
0 or 1 depending on whether 1 <i <n; or ny +1<%i<n. Then we can write the 
two sample location models as 


Z,=at+ Aci + €, (10.7.3) 


where €1,€2,...,€n are iid with cdf F(a). Hence the shift in locations is the slope 
parameter from this viewpoint. m 


Suppose the regression model (10.7.1) holds and, further, that Ho is true. Then 
we would expect that Y; and x; — % are not related and, in particular, that they 
are uncorrelated. Hence one could consider )>;"_, (x; — Z)Y; as a test statistic. As 
Exercise 9.6.11 of Chapter 9 shows, if we additionally assume that the random 
errors €; are normally distributed, this test statistic, properly standardized, is the 
likelihood ratio test statistic. Reasoning in the same way, for a specified score 
function we would expect that a,(R(Y;)) and x; — % are uncorrelated, under Hp. 
Therefore, consider the test statistic 


=D a; — Z)a,(R(¥i)), (10.7.4) 


where R(Y;) denotes the rank of Y; among Yj,..., Y, and ay(i) = y(i/(n+1)) fora 
Pawnee score function y(u) that is standardized, so that [ p(u) du = 0 and 
J ¢?(u) du = 1. Values of T, close to 0 indicate Ho is true. 

ime Ho is true. Then Y\,..-, Yn are iid random variables. Hence any per- 
mutation of the integers {1,2,..., nh is equilikely to be the ranks of Y1,...,Y,. So 
the distribution of T, is free of F(x). Note that the distribution does depend on 
1, 02,-..,Xy. Thus, tables of the distribution are not available, although with high- 
speed computing, this distribution can be generated. Because R(Y;) is uniformly 
distributed on the integers {1,2,...,n}, it is easy to show that the null expectation 
of T,, is zero. The null variance follows that of W, of Section 10.5, so we have left 
the details for Exercise 10.7.4. To summarize, the null moments are given by 


Exy,(T>)=0 and Vary, (T,) = Aedes (10.7.5) 


where s? is the mean sum of the squares of the scores (10.5.6). Also, it can be shown 
that the test statistic is asymptotically normal. Therefore, an asymptotic level a 
decision rule for the hypotheses (10.7.2) with the two-sided alternative is given by 


Reject Ho in favor of H, if |z| = — & Bic /0s (10.7.6) 
The associated process is given by 
Tp(8) =} (ai — B)ay(R(¥i — 2:8). (10.7.7) 
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Hence the corresponding estimate of (2 is given by Bos which solves the estimating 
equations 


Ty(By) ¥ 0. (10.7.8) 


Similar to Theorem 10.5.1, it can be shown that T.,(3) is a decreasing step function 
of @ that steps down at each sample slope (Y; — Y;)/(; — ,), for i # j. Thus the 
estimate exists. It cannot be obtained in closed form, but simple iterative techniques 
can be used to find the solution. In the regression problem, though, prediction of Y 
is often of interest, which also requires an estimate of a. Notice that such an estimate 
can be obtained as a location estimate based on residuals. This is discussed in some 
detail in Section 3.5.2 of Hettmansperger and McKean (2011). For our purposes, 
we consider the median of the residuals; that is, we estimate a@ as 


@ = med{Y; — 8,(x; — Z)}. (10.7.9) 


Remark 10.7.1 (Computation). The Wilcoxon estimates of slope and intercept are 
computed by several packages. We recommend the CRAN package Rfit developed 
by Kloke and McKean (2012). Chapter 4 of the book by Kloke and McKean (2014) 
discusses the use of Rfit for the simple regression model (10.7.1). Rfit has code 
for many score functions, including the Wilcoxon scores, normal scores, as well as 
scores appropriate for skewed error distributions. The computations in this section 
are performed by Rfit. Also, the minitab command rregr obtains the Wilcoxon 
fit. Terpstra and McKean (2005) have written a collection of R functions, ww, which 
obtains the fit using Wilcoxon scores. ™ 


Example 10.7.2 (Telephone Data). Consider the regression data discussed in Ex- 
ercise 9.6.3. Recall that the responses (y) for this data set are the numbers of 
telephone calls (tens of millions) made in Belgium for the years 1950-1973, while 
time in years serves as the predictor variable (2). The data are plotted in Figure 
10.7.1. The data are in the file telephone.rda. For this example, we use Wilcoxon 
scores to fit Model (10.7.1). The code and partial results (including the plot with 
overlaid fits) are: 

fitls <- lm(mumcall*year); fitrb <- rfit(numcall~year) 

fitls$coef; fitrb$coef # Result -26.0, 0.504; -7.1, 0.145 

plot (numcall~year,xlab="Year",ylab="Number of calls") 

abline(fitls); abline(fitrb,1ty=2) 

legend(50,15,c("LS-Fit", "Wilcoxon-Fit ") ,lty=c(1,2)) 
Thus, the Wilcoxon fitted value is Y,; = —7.1+ 0.1452; which is plotted in Figure 
10.7.1. The least squares fit Yhs = —26.04+ 0.504x;, is also plotted. Note that the 
Wilcoxon fit is much less sensitive to the outliers than the least squares fit. 

The outliers in this data set were recording errors; see page 25 of Rousseeuw 

and Leroy (1987) for more discussion. lm 


Similar to Lemma 10.2.1, a translation property holds for the process T(3) given 
by 
Eg[T(0)] = Eo[T(—-A)]; (10.7.10) 
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Figure 10.7.1: Plot of telephone data, Example 10.7.2, overlaid with Wilcoxon 
and LS fits. 


see Exercise 10.7.2. Further, as Exercise 10.7.5 shows, this property implies that 
the power curve for the one-sided tests of Hy : = 0 are monotone, assuring the 
unbiasedness of the tests based on Ty. 


We can now derive the efficacy of the process. Let pr(3) = Eg[T(0)] and 
o2,(0) = Varo[T'(0)]. Expression (10.7.5) gives the result for o7.(0). Recall that for 
the mean pi7(3), we need its derivative at 0. We freely use the relationship between 
rankings and the empirical cdf and then approximate this empirical cdf with the 
true cdf. Hence 


n 


ur (8) = Eg[T(0)] = EolT(-B)] = Dei — B)Eolay(R(Y% + 2:))| 
= D(a — Z)Eo |y oa 
a dei — Z)Eolp(F(% + 2i2))] 


= Y@-nf oFw+as)fwdy. Gora 


i=l =oo 
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Differentiating this last expression, we have 


Hr(8)=Dei— Tx [PW + a8) Fu + eA) lv) dy 
i=1 oe 
which yields 
HrO=e-a f Pw Pwd. (10.7.12) 
i=1 = 
We need one assumption on the 21, 72,...,2%nj; namely, n~! D7), (a; — FZ)? — 02, 


where 0 < 0? < oo. Recall that (n — 1)~'s? — 1. Therefore, the efficacy of the 
process T((3) is given by 


oe = tim TO) = dope (ts — 2)? fo, oF) FP? (y) dy 
7 noo Ynor(0) m2 Jan —1)-182,/9™ (ai — B 
2 ye (Py) f2(y) dy. (10.7.13) 


Using this, an asymptotic power lemma can be derived for the test based on T),; 
see expression (10.7.17) of Exercise 10.7.6. Based on this, it can be shown that the 
asymptotic distribution of the estimator @, is given by 


Bo has an approximate N (3 DWC -») distribution, (10.7.14) 


i=1 


where the scale parameter Ty is Ty = ([°>. ¢’(F(y))f7(y) dy)~*. Koul et al. (1987) 
developed a consistent estimator of the scale parameter 7, which is the default 
estimate in the package Rfit. This can be used to compute a confidence interval 
for the slope parameter, as illustrated in Example 10.7.3. 


Remark 10.7.2. The least squares (LS) estimates for Model (10.7.1) were discussed 
in Section 9.6 in the case that the random errors €1, €2,...,€n are iid with a N(0, 07) 
distribution. In general, for Model (10.7.1), the asymptotic distribution of the LS 
estimator of (, say Brg. is: 


Brg has an approximate N (3 2/50: -#) distribution, (10.7.15) 


w=1 


where o? is the variance of €;. Based on (10.7.14) and (10.7.15), it follows that the 
ARE between the rank-based and LS estimators is given by 


ares 2, 
ARE(8,, 8hg) = a (10.7.16) 
7) 


Hence, if Wilcoxon scores are used, this ARE is the same as the ARE between the 
Wilcoxon and t-procedures in the one- and two-sample location models. m 
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Example 10.7.3 (Distance of Punts). Rasmussen (1992), page 562, presents a data 
set concerning distance of punts along with several predictors. The actual response 
is the average distance in feet of 10 punts for each of 13 punters. As a predictor, we 
consider the average hang-time in seconds (the time the punted football is in the 
air). The data are in the file punter.rda. Based on the plot (see Exercise 10.7.1), 
the simple linear model seems reasonable as an initial fit. Next is the code and 
partial results of the Wilcoxon fit: 

fit <- rfit(distance”hangtime); summary(fit) 

Estimate Std. Error t.value p.value 

(Intercept) -18.180 51.201 -0.3551 0.729254 

hangtime 41.010 12.882 3.1834 0.008708 ** 
The second line of the summary table gives the Wilcoxon estimate of the slope 
(41.01) and the standard error of the estimate (12.89). Hence, we predict that the 
football travels an additional 41 feet for each additional second of hang-time. An 
approximate 95% confidence interval for the true slope, using the ¢-critical with 11 
degrees of freedom is (12.66, 69.36). So with approximate confidence of 95% the 
slope differs from 0. 


EXERCISES 
10.7.1. Consider the data on football punts in Example 10.7.3. 


(a) Obtain the scatterplot of distance versus hang-time and overlay the Wilcoxon 
fit. 


(b) As a second predictor consider overall strength of the kicker which is in the 
variable strength. Obtain the scatterplot of distance versus strength and 
overlay the Wilcoxon fit. What is the meaning of the slope parameter for this 
predictor. Answer using a 95% confidence interval for the slope. 


10.7.2. Establish expression (10.7.10). To do this, note first that the expression is 
the same as 


n 


Eg |>_ (ai — Bay(R(¥i)) 


i=l 


Show that the cdfs of Y; (under 3) and Y; + (a; — %)G (under 0) are the same. 


10.7.3. Suppose we have a two-sample model given by (10.7.3). Assuming Wilcoxon 
scores, show that the test statistic (10.7.4) is equivalent to the Wilcoxon test statistic 
found in expression (10.4.5). 


10.7.4. Show that the null variance of the test statistic T, is the value given in 
(10.7.5). 


10.7.5. Show that the translation property (10.7.10) implies that the power curve 
for either one-sided test based on the test statistic Ti, of Hp : @ = 0 is monotone. 
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10.7.6. Consider the sequence of local alternatives given by the hypotheses 
Ho: 8 =0 versus Hy, : B= B,= 5, 
where 3; > 0. Let y(3) be the power function discussed in Exercise 10.7.5 for an 


asymptotic level a test based on the test statistic T,. Using the mean value theorem 
to approximate pr(G,,), sketch a proof of the limit 


lim ¥(8n) = 1— ®(za — eri). (10.7.17) 


n—oo 


10.8 Measures of Association 


In the last section, we discussed the simple linear regression model in which the 
random variables, Ys, were the responses or dependent variables, while the zs were 
the independent variables and were thought of as fixed. Regression models occur in 
several ways. In an experimental design, the values of the independent variables are 
prespecified and the responses are observed. Bioassays (dose-response experiments) 
are examples. The doses are fixed and the responses are observed. If the experimen- 
tal design is performed in a controlled environment (for example, all other variables 
are controlled), it may be possible to establish cause and effect between « and Y. 
On the other hand, in observational studies both the xs and Ys are observed. In the 
regression setting, we are still interested in predicting Y in terms of x, but usually 
cause and effect between a and Y are precluded in such studies (other variables 
besides « may be changing). 

In this section, we focus on observational studies but are interested in the 
strength of the association between Y and x. So both X and Y are treated as 
random variables in this section and the underlying distribution of interest is the 
bivariate distribution of the pair (X,Y). We assume that this bivariate distribution 
is continuous with cdf F(a, y) and pdf f(x,y). 

Hence, let (X,Y) be a pair of random variables. A natural null model (baseline 
model) is that there is no relationship between X and Y; that is, the null hypothesis 
is given by Hy: X and Y are independent. Alternatives, though, depend on which 
measure of association is of interest. For example, if we are interested in the cor- 
relation between X and Y, we use the correlation coefficient p (Section 9.7) as our 
measure of the association. A two-sided alternative in this case is Hi: p #0. Re- 
call that independence between X and Y implies that p = 0, but that the converse 
is not true. However, the contrapositive is true; that is, p 4 0 implies that X and 
Y are dependent. So, in rejecting Ho, we conclude that X and Y are dependent. 
Furthermore, the size of p indicates the strength of the correlation between X and 
Y. 


10.8.1 Kendall’s 7 


The first measure of association that we consider in this section is a measure of the 
monotonicity between X and Y. Monotonicity is an easily understood association 
between X and Y. Let (Xi, Yi) and (X2,Y2) be independent pairs with the same 
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bivariate distribution (discrete or continuous). We say these pairs are concordant if 
sen{ (X1— X2)(Yi — Y2)} = 1 and are discordant if sgn{(X,— X2)(Y¥i —Y2)} = -1. 
The variables X and Y have an increasing relationship if the pairs tend to be 
concordant and a decreasing relationship if the pairs tend to be discordant. A 
measure of this is given by Kendall’s 7, 


7 = Pilsen {(X1 — X2)(¥; — Yo)} = 1] — Plsgn {(X1 —X2)(¥i —Yo)} = —1]. (10.8.1) 


As Exercise 10.8.1 shows, —1 < 7 < 1. Positive values of 7 indicate increasing 
monotonicity, negative values indicate decreasing monotonicity, and 7 = 0 reflects 
neither. Furthermore, as the following theorem shows, if X and Y are independent, 
then T = 0. 


Theorem 10.8.1. Let (X1, Yi) and (X2, Y2) be independent pairs of observations of 
(X,Y), which has a continuous bivariate distribution. If X and Y are independent, 
then T = 0. 


Proof: Let (X1, Y,) and (X2, Y2) be independent pairs of observations with the same 
continuous bivariate distribution as (X,Y). Because the cdf is continuous, the sign 
function is either —1 or 1. By independence, we have 


Plsgn(Xi — X2)(¥i-Yo)=1] = PX > Xa} N {Ki > Yo}] 
Pa Ae A 95) | 
= PX, > X9]P[¥, > Yo] 

+ P[X, < Xe]P[Yi < Yo] 


Likewise, P[sgn(X1 — X2)(¥i — Y2) = —1] = 4; hence, 7 = 0. m 


Relative to Kendall’s 7 as the measure of association, the two-sided hypotheses 
of interest here are 
Ho: 7 =0 versus H,: 7 £0. (10.8.2) 


As Exercise 10.8.1 shows, the converse of Theorem 10.8.1 is false. However, the 
contrapositive is true; i.e., 7 #0 implies that X and Y are dependent. As with the 
correlation coefficient, in rejecting Ho, we conclude that X and Y are dependent. 

Kendall’s 7 has a simple unbiased estimator. Let (X1, Yi), (Xe, Y2),.--,; (Xn, Yn) 
be a random sample of the cdf F(x, y). Define the statistic 


-1 
n 
K= (5) S © sen {(X; — Xj)(¥i — Yj)}- (10.8.3) 
i<j 
Note that for all i ~ j, the pairs (X;,Y;) and (X,, Yj) are identically distributed. 
r n\—lin 

Thus E(K) = (3) (3) Elsen {(X1 — X2)(Yi — Y2)}) =r. 
In order to use K as a test statistic of the hypotheses (10.8.2), we need its 
distribution under the null hypothesis. Under Ho, tT = 0, so Ey,(K) = 0. The 
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null variance of K is given by expression (10.8.6); see, for instance, page 205 of 
Hettmansperger (1984). If all pairs (X;, Y;),(X,, Y;) of the sample are concordant 
then K = 1, indicating a strictly increasing monotone relationship. On the other 
hand, if all pairs are discordant then K = —1. Thus the range of K is contained in 
the interval [—1, 1]. Also, the summands in expression (10.8.3) are either +1. From 
the proof of Theorem 10.8.1, the probability that a summand is 1 is 1/2, which does 
not depend on the underlying distribution. Hence the statistic K is distribution- 
free under Ho. The null distribution of K is symmetric about 0. This is easily seen 
from the fact that for each concordant pair there is an obvious discordant pair (just 
reverse an inequality on the Ys) and the fact that concordant and discordant pairs 
are equilikely under Hp. Also, it can be shown that K is asymptotically normal 
under Hp. We summarize these results, without proof, in a theorem. 


Theorem 10.8.2. Let (X1, Yi), (X2, Y2),.--,(Xn, Yn) be a random sample on the 
bivariate random vector (X,Y) with continuous cdf F(a,y). Under the null hypoth- 
esis of independence between X and Y, 1.e., F (x,y) = Fx(x)Fy(y), for all (x,y) 
in the support of (X,Y), the test statistic K satisfies the following properties: 


K is distribution free with a symmetric pmf (10.8.4) 
Ex, [|K] = 0 (10.8.5) 
2 2n+5 
Vary (K) an) (10.8.6) 
K 


+ —_ has an asymptotic N(0,1) distribution. (10.8.7) 
/ Varn, (K) 


Most statistical computing packages compute Kendall’s 7. For instance, the R 
function cor.test(x,y,method=c("kendall") ,exact=T) obtains K and the test 
discussed above when x and y are the vectors of the X and Y observations, respec- 
tively. The computation of the p-value is with the exact distribution. We illustrate 
this test in the next example. 

Based on the asymptotic distribution, a large sample level a test for the hy- 
potheses (10.8.2) is to reject Ho if ZK > Zq/2, where 


K 
/2(2n + 5)/9n(n — 1) 


Example 10.8.1 (Olympic Race Times). Table 10.8.1 displays the winning times 
for two races in the Olympics beginning with the 1896 Olympics through the 1980 
Olympics. The data were taken from Hettmansperger (1984) and can be found in 
the data set olym1500mara.rda. The times in seconds are for the 1500 m and the 
marathon. The entries in the table for the marathon race are the actual times minus 
2 hours. In Exercise 10.8.2 the reader is asked to create a scatterplot of the times 
for the two races. The plot shows a strongly increasing monotone trend with one 
obvious outlier (1968 Olympics). The following R code computes Kendall’s r. We 
have summarized the results with the estimate of Kendall’s 7 and the p-value of the 
test of no association. This p-value is based on the exact distribution. 
cor.test (m1500,marathon ,method="kendall", exact=T) 


Ze = (10.8.8) 
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Table 10.8.1: Data for Example 10.8.1 


Year | 1500 m | Marathon* | Year | 1500 m | Marathon* 
1896 373.2 3530 1936 227.8 1759 
1900 246.0 3585 1948 229.8 2092 
1904 245.4 5333 1952 225.2 1383 
1906 252.0 3084 1956 221.2 1500 
1908 243.4 3318 1960 215.6 916 
1912 236.8 2215 1964 218.1 731 
1920 241.8 1956 1968 214.9 1226 
1924 233.6 2483 1972 216.3 740 
1928 233.2 1977 1976 219.2 595 
1932 231.2 1896 1980 218.4 663 


* Actual marathon times are 2 hours + entry. 


p-value = 3.319e-06; estimates: tau 0.6947368 
The test results show strong evidence to reject the hypothesis of the independence 
of the winning times of the races. m 


10.8.2 Spearman’s Rho 


As above, assume that (X1, Yi), (Xo, Y2),...,(Xn,Yn) is a random sample from 
a bivariate continuous cdf F(a,y). The population correlation coefficient p is a 
measure of linearity between X and Y. The usual estimate is the sample correlation 
coefficient given by 

eae - 24 -Y) 


i=1 


eA Ak (10.8.9) 


5 0G- 3 eee 


see Section 9.7. A simple rank analog is to replace X; by R(X;), where R(X;) 
denotes the rank of X; among Xj,...,Xn, and likewise Y; by R(Y;), where R(Y;) 
denotes the rank of Y; among Yj,...,Y,. Upon making this substitution, the de- 
nominator of the above ratio is a constant. This results in the statistic 


deer (R(X) — 95+) RCV) — *3*) 


CEaIVAE (10.8.10) 


rs = 


which is called Spearman’s rho. The statistic rg is a correlation coefficient, so 
the inequality -—1 < rg < 1 is true. Further, as the following theorem shows, 
independence implies that the mean of rg is 0. 


Theorem 10.8.3. Suppose (X1,Y1), (X2, Y2),.--;(Xn, Yn) is a sample on (X,Y), 
where (X,Y) has the continuous cdf F(a,y). If X and Y are independent, then 
E(rg) = 0. 
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Proof: Under independence, X; and Y; are independent for all 7 and j; hence, 
in particular, R(X;) is independent of R(Y;). Furthermore, R(X;) is uniformly 
distributed on the integers {1,2,...,n}. Therefore, E(R(X;)) = (n + 1)/2, which 
leads to the result. m 


Thus the measure of association rg can be used to test the null hypothe- 
sis of independence similar to Kendall’s kK. Under independence, because the 
X;js are a random sample, the random vector (R(X1),...,R(Xn)) is equilikely 
to assume any permutation of the integers {1,2,...,n} and, likewise, the vector 
of the ranks of the Y;s._ Furthermore, under independence, the random vector 
[R(X1),..-,R(Xn), R(N%),-..,R(Yn)] is equilikely to assume any of the (n!)? vec- 
tors (i1, t2, eae stn J1sJ25 Para sua) where (44, 12, Para yin) and (J1; Ja; one Jn) are per- 
mutations of the integers {1,2,...,n}. Hence, under independence, the statistic 
rg is distribution-free. The distribution is discrete and tables of it can be found, 
for instance, in Hollander and Wolfe (1999). Similar to Kendall’s statistic AK, the 
distribution is symmetric about zero and it has an asymptotic normal distribution 
with asymptotic variance 1/(n — 1); see Exercise 10.8.7 for a proof of the null vari- 
ance of r;. A large sample level a test is to reject independence between X and Y 
if |z5| > Za/2, where zg = Vn — Irs. We record these results in a theorem, without 
proof. 


Theorem 10.8.4. Let (X1, Yi), (X2, Y2),.--,(Xn,¥n) be a random sample on the 
bivariate random vector (X,Y) with continuous cdf F(x,y). Under the null hypoth- 
esis of independence between X and Y, 1.e., F (x,y) = Fx(x)Fy(y), for all (x,y) 
in the support of (X,Y), the test statistic rg satisfies the following properties: 


rg is distribution-free, symmetrically distributed about 0 (10.8.11) 

Ex, (rs| =0 (10.8.12) 
1 

Vary, (rg) = a | (10.8.13) 

= is asymptotically N(0, 1). (10.8.14) 


/ Varuy (rs) 


Example 10.8.2 (Example 10.8.1, Continued). For the data in Example 10.8.1, 
the R code for the analysis based on Spearman’s p is: 

cor.test (m1500, marathon ,method="spearman" ) 

p-value = 2.021e-06; sample estimates: rho 0.9052632 
The result is highly significant. For comparison, the value of the asymptotic test 
statistic is Zy = 0.90519 = 3.94 with the p-value for a two-sided test is 0.00008; 
so, the results are quite similar. m 


If the samples have a strictly increasing monotone relationship, then it is easy to 
see that rg = 1; while if they have a strictly decreasing monotone relationship, then 
rg = —1. Like Kendall’s K statistic, rg is an estimate of a population parameter, 
but, except for when X and Y are independent, it is a more complicated expression 
than r. It can be shown (see Kendall, 1962) that 


E(rs) = [r + (n — 2)(27—1)], (10.8.15) 


n+1 
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where y = P[(X2 — X1)(Y3 — Yi) > O}. For large n, E(rgs) © 6(y — 1/2), which is a 
harder parameter to interpret than the measure of concordance T. 

Spearman’s rho is based on Wilcoxon scores; hence, it can easily be extended to 
other rank score functions. Some of these measures are discussed in the exercises. 


Remark 10.8.1 (Confidence Intervals). Distribution-free confidence intervals for 
Kendall’s 7 exist; see, Section 8.5 of Hollander and Wolfe (1999). As outlined in 
Exercise 10.8.6, it is easy to construct percentile bootstrap confidence intervals for 
both parameters. The R function cor.boot.ci in the CRAN package npsm obtains 
such confidence intervals; see Section 4.8 of Kloke and McKean (2014) for discussion. 
It also requires the CRAN package boot developed by Canty and Ripley (2017). 
We used this function to compute confidence intervals for 7 and ps: 

library (boot); library (npsm) 

cor. boot.ci(m1500 ,marathon,method="spearman"); # (0.719,0.955) 

cor.boot.ci(m1500,marathon,method="kendall"); # (0.494,0.845) 


EXERCISES 
10.8.1. Show that Kendall’s 7 satisfies the inequality —1 <7 < 1. 


10.8.2. Consider Example 10.8.1. Let Y = winning times of the 1500 m race for a 
particular year and let X = winning times of the marathon for that year. Obtain 
a scatterplot of Y versus X, and determine the outlying point. 


10.8.3. Consider the last exercise as a regression problem. Suppose we are inter- 
ested in predicting the 1500 m winning time based on the marathon winning time. 
Assume a simple linear model and obtain the least squares and Wilcoxon (Section 
10.7) fits of the data. Overlay the fits on the scatterplot obtained in Exercise 10.8.2. 
Comment on the fits. What does the slope parameter mean in this problem? 


10.8.4. With regards to Exercise 10.8.3, a more interesting predicting problem is 
the prediction of winning time of either race based on year. 


(a) Make a scatterplot of the winning 1500 m race time versus year. Assume a 
simple linear model (does the assumption make sense?) and obtain the least 
squares and Wilcoxon (Section 10.7) fits of the data. Overlay the fits on the 
scatterplot. Comment on the fits. What does the slope parameter mean in this 
problem? Predict the winning time for 1984. How close was your prediction 
to the true winning time? 


(b) Same as part (a), except use the winning time of the marathon for that year. 


10.8.5. Spearman’s rho is a rank correlation coefficient based on Wilcoxon scores. 
In this exercise we consider a rank correlation coefficient based on a general score 
function. Let (X41, Y1), (X2, Y2),..-,(Xn, Yn) be a random sample from a bivariate 
continuous cdf F(z, y). Let a(t) = y(i/(n+1)), where >", a(i) = 0. In particular, 
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@ = 0. As in expression (10.5.6), let s2 = }°"_, a?(i). Consider the rank correlation 
coefficient, 


(R(Yi)). (10.8.16) 


= 
a 
= 


(a) Show that rq is a correlation coefficient on the items 
{(a[R(X1)], alR(Y1))), (a[R(X2)], al R(Y2))), ae (a[R(Xn)], alR(Yn)])}- 


(b) For the score function y(u) = V12(u—(1/2)), show that rg = rg, Spearman’s 
rho. 


(c) Obtain rq for the sign score function y(u) = sgn(u — (1/2)). Call this rank 
correlation coefficient rg. (The subscript gc is obvious from Exercise 10.8.8.) 


10.8.6. Write an R function that computes a percentile bootstrap confidence inter- 
val for Kendall’s 7. Run your function for the data discussed in Example 10.8.1 and 
compare your answer with the confidence interval for Kendall’s 7 given in Remark 
10.8.1. 
Note: The following R code obtains resampled vectors of x and y: 
ind = 1:length(x); mat=cbind(x,y); inds=sample(ind,n,replace=T) 
mats=mat[inds,]; xs=mats[,1]; ys=mats[,2] 


10.8.7. Consider the general score rank correlation coefficient r, defined in Exercise 
10.8.5. Consider the null hypothesis Hp : X and Y are independent. 


(a) Show that Ey, (ra) = 0. 


(b) Based on part (a) and Ho, as a first step in obtaining the variance under Ho, 
show that the following expression is true: 


Var rg (Ta) = LOY Bala X;))a(R(X;))| Ban (a(R) )a(R(¥5))]. 


$@ jai jal 


(c) To determine the expectation in the last expression, consider the two cases 
i=jandi#j. Then using uniformity of the distribution of the ranks, show 
that i 

Var Hy (Ta) = ee aoe (10.8.17) 


stn-1°% n-1 


10.8.8. Consider the rank correlation coefficient given by rg. in part (c) of Exer- 
cise 10.8.5. Let Qox and Qe2y denote the medians of the samples Xj,...,X, and 
Yi,.--, Yn, respectively. Now consider the four quadrants: 


i= {(a; 
if = | 
= 4 

{ 


:@ > Qox,y > Qay} 
:2<Qox,y > Qay} 
:@<Qox,y < Qay} 


IV = 1a > Qax,y < Gay}. 
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Show essentially that 


1 
Tac = —{#(Xi, Vi) © 1+ #4 (X,Y) € 111 -— #(XG, Y;) € 1 — #(X, Yi) € IV}. 
n 
(10.8.18) 
Hence, rgc is referred to as the quadrant count correlation coefficient. 


10.8.9. Set up the asymptotic test of independence using rg. of the last exer- 
cise. Then use it to test for independence between the 1500 m race times and the 
marathon race times of the data in Example 10.8.1. 


10.8.10. Obtain the rank correlation coefficient when normal scores are used; that 
is, the scores are a(i) = ®~'(i/(n + 1)), i = 1,...,n. Call it ry. Set up the 
asymptotic test of independence using ry of the last exercise. Then use it to test 
for independence between the 1500 m race times and the marathon race times of 
the data in Example 10.8.1. 


10.8.11. Suppose that the hypothesis Hp concerns the independence of two random 
variables X and Y. That is, we wish to test Hp : F(x, y) = Fi(x) Fo(y), where F, Fi, 
and F» are the respective joint and marginal distribution functions of the continuous 
type, against all alternatives. Let (X,Y), (Xe, Y2),...,(Xn, Yn) be a random sam- 
ple from the joint distribution. Under Ho, the order statistics of X,, X2,..., X» and 
the order statistics of Y|, Y2,..., Yn are, respectively, complete sufficient statistics 
for F, and Fy. Use rg, rgc, and ry to create an adaptive distribution-free test of Ho. 


Remark 10.8.2. It is interesting to note that in an adaptive procedure it would 
be possible to use different score functions for the Xs and Ys. That is, the order 
statistics of the X values might suggest one score function and those of the Ys 
another score function. Under the null hypothesis of independence, the resulting 
procedure would produce an a level test. m 


10.9 Robust Concepts 


In this section, we introduce some of the concepts in robust estimation. We intro- 
duce these concepts for the location model discussed in Sections 10.1-10.3 of this 
chapter and then apply them to the simple linear regression model of Section 10.7. 
In a review article, McKean (2004) presents three introductory lectures on robust 
concepts. 


10.9.1 Location Model 


In a few words, we say an estimator is robust if it is not sensitive to outliers in the 
data. In this section, we make this more precise for the location model. Suppose 
then that X1, X2,...,X», is a random sample which follows the location model as 
given in Definition 10.1.2; i.e., 


X,=0+6;, (eae ere f (10.9.1) 
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where @ is a location parameter (functional) and ¢; has cdf F(t) and pdf f(t). Let 
Fx (t) and fx(t) denote the cdf and pdf of X, respectively. Then Fx (t) = F(t — 6) 
and fx(t) = f(t — 8). 

To illustrate the robust concepts, we use the location estimators discussed in 
Sections 10.1—10.3: the sample mean, the sample median, and the Hodges-Lehmann 
estimator. It is convenient to define these estimators in terms of their estimating 
equations. The estimating equation of the sample mean is given by 


n 


S-(Xi — 9) = 0; (10.9.2) 


i=l 


i.e., the solution to this equation is @=X. The estimating equation for the sample 
median is given in expression (10.2.34), which, for convenience, we repeat: 


3 sgn(X; — 0) = 0. (10.9.3) 


Recall from Section 10.2 that the sample median minimizes the Lj-norm. So in 
this section, we denote it as 0, = med X;. Finally, the estimating equation for the 
Hodges—Lehmann estimator is given by expression (10.4.27). For this section, we 
denote the solution to this equation by 


~ X,+ X; 
yy, = medi<; {4} , (10.9.4) 
Suppose, in general, then that we have a random sample Xj, X2,...,Xn, which 


follows the location model (10.9.1) with location parameter 0. Let @ be an estimator 
of #. Hopefully, @ is not unduly influenced by an outlier in the sample, that is, a 
point that is at a distance from the other points in the sample. For a realization of 
the sample, this sensitivity to outliers is easy to measure. We simply add an outlier 
to the data set and observe the change in the estimator. 

More formally, let x, = (a1, 22,...,2n) be a realization of the sample, let a be 
the additional point, and denote the augmented sample by x/,,, = (x},,v). Then a 
simple measure is the rate of change in the estimate due to x relative to the mass 
of x, (1/(n+1)); ie, 

S(a; 0) = O(Xn41) — (Xn) 
1/(n+1) 


This is called the sensitivity curve of the estimate 0. 
As examples, consider the sample mean and median. For the sample mean, it is 
easy to see that 


(10.9.5) 


> n+l — In —_ 
S(a;X) = ————— = £ —- & pn. 10.9.6 
( ) 1/(n +1) . ( ) 
Hence the relative change in the sample mean is a linear function of x. Thus, if 
x is large, then the change in sample mean is also large. Actually, the change is 
unbounded in z. Thus the sample mean is quite sensitive to the size of the outlier. 
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In contrast, consider the sample median in which the sample size n is odd. In 
this case, the sample median is 67, = 2p), where r = (n + 1)/2. When the 
additional point x is added, the sample size becomes even and the sample median 
Or,,n+1 is the average of the middle two order statistics. If x varies between these 
two order statistics, then there is some change between the 6r,. n and Or, nt1- But 
once x moves beyond these middle two order statistics, there is no change. Hence 
S(a; 0, L,,n) is a bounded function of v. Therefore, 0, L,,n is much less sensitive to an 
outlier than the sample mean. = 

Because the Hodges—Lehmann estimator Oy7,, (10.9.4), is also a median, its 
sensitivity curve is also bounded. Exercise 10.9.2 provides a numerical illustration 
of these sensitivity curves. 


Influence Functions 


One problem with the sensitivity curve is its dependence on the sample. In earlier 
chapters, we compared estimators in terms of their variances which are functions of 
the underlying distribution. This is the type of comparison we want to make here. 

Recall that the location model (10.9.1) is the model of interest, where F'x (t) = 
F(t — 0) is the cdf of X and F(t) is the cdf of ¢. As discussed in Section 10.1, the 
parameter @ is a function of the cdf F'x(«). It is convenient, then, to use functional 
notation 6 = T(Fx), as in Section 10.1. For example, if @ is the mean, then T(F'x) 
is defined as 


T(Fx) = a xdFx (x) = [- ufx (x) da, (10.9.7) 


—co —oco 


while if 0 is the median, then T(F'x) is defined as 


T(Fe)=F," (5) (10.9.8) 


It was shown in Section 10.1 that for a location functional, T(F'x) = T(F) +90. 

Estimating equations (EE) such as those defined in expressions (10.9.2) and 
(10.9.3) are often quite intuitive, for example, based on likelihood equations or 
methods such as least squares. On the other hand, functionals are more of an ab- 
stract concept. But often the estimating equations naturally lead to the functionals. 
We outline this next for the mean and median functionals. 

Let F;, be the empirical distribution function of the realized sample 71, %2,...,%n. 
That is, F;, is the cdf of the distribution which puts mass n~! on each 2;; see (10.1.1). 
Note that we can write the estimating equation (10.9.2), which defines the sample 


mean as 
n 


So (ai - 0) =0. (10.9.9) 


i=1 
This is an expectation using the empirical distribution. Since F;,, — Fx in proba- 
bility, it would seem that this expectation converges to 


ia [x — T(Fx)]fx (x) dx = 0. (10.9.10) 
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The solution to the above equation is, of course, T(F'x) = E(X). 
Likewise, we can write the estimating equation (EE), (10.9.3), which defines the 
sample median, as 


” 1 
S > sen(X; — a)— =0. (10.9.11) 
i=1 


The corresponding equation for the functional 6 = T(Fx) is the solution of the 
equation 
/ sgn[y — T(F'x)| fx (y) dy = 0. (10.9.12) 


Note that this can be written as 


T (Fx ) ioe) 
o=-f  tedy+ [flay = -FeIP(Fx)) +1- FxITFx)] 
—oo T(Fx) 

Hence Fx[T(Fx)] = 1/2 or T(Fx) = Fx'(1/2). Thus T(Fx) is the median of the 
distribution of X. 

Now we want to consider how a given functional T'(F'y ) changes relative to some 
perturbation. The analog of adding an outlier to F'(t) is to consider a point-mass 
contamination of the cdf Fy (t) at a point x. That is, for « > 0, let 


F,,-(t) = (1 — €) Fx (t) + «A;(t), (10.9.13) 
where A,.(t) is the cdf with all its mass at 2; ie., 
O t<a 
A,(t) = { i, oe (10.9.14) 


The cdf F,-(t) is a mixture of two distributions. When sampling from it, (1—«)100% 
of the time an observation is drawn from F'x(t), while «100% of the time x (an 
outlier) is drawn. So « has the flavor of the outlier in the sensitivity curve. As 
Exercise 10.9.4 shows, F,,-(t) is in an € neighborhood of Fx (t); that is, for all 2, 
|F, <(t) — Fx (t)| < e. Hence the functional at F,,-(t) should also be close to T(F'x). 
The concept for functionals, corresponding to the sensitivity curve, is the function 

IF(x;0) = lim P(Fe,e) — T(Fx) 


«0 € 


. (10.9.15) 
provided the limit exists. The function IF (a; 6) is called the influence function of 
the estimator # at x. As the notation suggests, it can be thought of as a derivative 
of the functional T(F,.) with respect to € evaluated at 0, and we often determine 
it this way. Note that for € small, 


T (Fr) © T(Fx) + € IF (2; 0); 
hence, the change of the functional due to point-mass contamination is approxi- 
mately directly proportional to the influence function. We want estimators, whose 
influence functions are not sensitive to outliers. Further, as mentioned above, for 
any x, F,,-(t) is close to Fx (t). Hence, at least, the influence function should be a 
bounded function of a. 
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Definition 10.9.1. The estimator 0 is said to be robust if |IF (2; 0)| is bounded 
for all x. 


Hampel (1974) proposed the influence function and discussed its important prop- 
erties, a few of which we list below. First, however, we determine the influence 
functions of the sample mean and median. 

For the sample mean, recall Section 3.4.1 on mixture distributions. The function 
F,,,-(t) is the cdf of the random variable U = _.X + [1— _.JW, where X, Ii_<, 
and W are independent random variables, X has cdf Fy (t), W has cdf A,(t), and 
I,_. is b(1, 1 —). Hence 


E(U) =(1-60)E(X)+c«E(W) = (1-6)E(X) + ex. 


Denote the mean functional by T),(Fx) = E(X). In terms of T,,(F'), we have just 
shown that 
Tu(Fr,.) = (1 — €)T, (Fx) + ex. 
Therefore, 
De =—-T,(F)+¢2. 


Hence the influence function of the sample mean is 
IF(x;X) =2— p, (10.9.16) 


where 4 = E(X). The influence function of the sample mean is linear in « and, 
hence, is an unbounded function of x. Therefore, the sample mean is not a robust 
estimator. Another way to derive the influence function is to differentiate implicitly 
equation (10.9.10) when this equation is defined for F),,.(¢); see Exercise 10.9.6. 


Example 10.9.1 (Influence Function of the Sample Median). In this example, we 
derive the influence function of the sample median, 61, In this case, the functional 
is Ty(F’) = F~1(1/2), ie., the median of F'. To determine the influence function, we 
first need to determine the functional at the contaminated cdf F,, .(t), i.e., determine 
Fy? (1/2). As shown in Exercise 10.9.8, the inverse of the cdf F,,<(t) is given by 


Fl(s+) u< F(z) 
(10.9.17) 
<) u> F(x), 


for 0 <u <1. Hence, letting u = 1/2, we get 


rg (M2) FG) <e 
1 
2 


= 1/2)—-e = 
Fg (GER) Fe" ( 


l—e 


Teh.) =F, 0/2) = (10.9.18) 


Based on (10.9.18) the partial derivative of F7{ (1/2) with respect to € is seen to be 


(1/2)(1=0)? 1/1 
OTo(Fr.e) _ ) Fx lFx(G/9)/G—) aC ee (10.9.19) 
De (=1/2)(1-«)~* re G —* 


Fx [Fx "({(1/2)—4}/{1-e})] 
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Evaluating this partial derivative at € = 0, we arrive at the influence function of the 


median: 
~ st. O<: = 
IF(a;6r,)= 4 74x{%) oe ae). (10.9.20) 


where 6 is the median of Fx. Because this influence function is bounded, the sample 
median is a robust estimator. 


As derived on p. 46 of Hettmansperger and McKean (2011), the influence func- 
tion of the Hodges—Lehmann estimator, Oy], at the point x is given by: 
~ F —1/2 
IF (2; O41, = eslioa is (10.9.21) 
Since a cdf is bounded, the Hodges—Lehmann estimator is robust. 

We now list three useful properties of the influence function of an estimator. 
Note that for the sample mean, E[IF(X;X)] = E[X]— yu = 0. This is true in 
general. Let IF(x) = IF(x;0) denote the influence function of the estimator 0 with 
functional 6 = T(F'x). Then 

E(IF(X)] = 0, (10.9.22) 
provided expectations exist; see Huber (1981) for a discussion. Hence, for the second 
property, we have 

Var|[IF(X)] = E[IF?(X)], (10.9.23) 
provided the squared expectation exists. A third property of the influence function 
is the asymptotic result 


Jie =8) = 4 Drx) opti), (10.9.24) 


Assume that the variance (10.9.23) exists, then because IF(X1),...,IF(X,,) are iid 
with finite variance, the simple Central Limit Theorem and (10.9.24) imply that 


Vn{é — 6] 2 N(0, E[IF?(X))). (10.9.25) 


Thus we can obtain the asymptotic distribution of the estimator from its influence 
function. Under general conditions, expression (10.9.24) holds, but often the verifi- 
cation of the conditions is difficult and the asymptotic distribution can be obtained 
more easily in another way; see Huber (1981) for a discussion. In this chapter, 
though, we use (10.9.24) to obtain asymptotic distributions of estimators. Suppose 
(10.9.24) holds for the estimators 6, and 62, which are both estimators of the same 
functional, say, 8. Then, letting IF; denote the influence function of 0;, i= 1,2, we 
can express the asymptotic relative efficiency between the two estimators as 


E(IF3(X)] 


ARE(6;, 02) = Sea 


(10.9.26) 


As an example, we consider the sample median. 
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Example 10.9.2 (Asymptotic Distribution of the Sample Median). The influence 
function for the sample median 67, is given by (10.9.20). Since E[sen?(X — 6)] = 1, 
by expression (10.9.25) the asymptotic distribution of the sample median is 


Vn{ — 6] 3 N (0, [2fx(6]-*), 


where @ is the median of the pdf fx(t). This agrees with the result given in Section 
10.2. m 


Breakdown Point of an Estimator 


The influence function of an estimator measures the sensitivity of an estimator to 
a single outlier, sometimes called the local sensitivity of the estimator. We next 
discuss a measure of global sensitivity of an estimator. That is, what proportion of 
outliers can an estimator tolerate without completely breaking down? 


To be precise, let x’ = (41, %2,...,%n) be a realization of a sample. Suppose we 
corrupt m points of this sample by replacing x1,...,%m by zj,...,77,, where these 
points are large outliers. Let x, = (@j,..-,0%,,@m41,---,n) denote the corrupted 


sample. Define the bias of the estimator upon corrupting m data points to be 


Aw ~~ 


bias(m, Xn, 9) = sup |0(xm) — O(Xn)|, (10.9.27) 


where the sup is taken over all possible corrupted samples x,,,. If this bias is infinite, 
we say that the estimator has broken down. The smallest proportion of corruption 
an estimator can tolerate until its breakdown is called its finite sample breakdown 
point. More precisely, if 


n~ 


e, =min{m/n: bias(m, Xn, 0) = co}, (10.9.28) 


then ¢* is called the finite sample breakdown point of @. If the limit 


6,7 € (10.9.29) 
exists, we call «* the breakdown point of é. 

To determine the breakdown point of the sample mean, suppose we corrupt 
one data point, say, without loss of generality, the first data point. The corrupted 
sample is then x’ = (x},2%2,...,%,). Denote the sample mean of the corrupted 
sample by %*. Then it is easy to see that 


—* = 1 * 

~-f= 7 (a1 X1). 

Hence bias(1,x,,,%) is a linear function of x} and can be made as large (in absolute 
value) as desired by taking xj large (in absolute value). Therefore, the finite sample 
breakdown of the sample mean is 1/n. Because this goes to 0 as n — ov, the 
breakdown point of the sample mean is 0. 
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Example 10.9.3 (Breakdown Value of the Sample Median). Next consider the 
sample median. Let x, = (#1,%2,...,%pn) be a realization of a random sample. If 
the sample size is n = 2k, then it is easy to see that in a corrupted sample x, when 
Zz) tends to —oo, the median also tends to —oo. Hence the breakdown value of the 
sample median is k/n, which tends to 0.5. By a similar argument, when the sample 
size is n = 2k + 1, the breakdown value is (kK + 1)/n and it also tends to 0.5 as the 
sample size increases. Hence we say that the sample median is a 50% breakdown 
estimate. For a location model, 50% breakdown is the highest possible breakdown 
point for an estimate. Thus the median achieves the highest possible breakdown 
point. m 


In Exercise 10.9.10, the reader is asked to show that the Hodges—Lehmann esti- 
mate has the breakdown point of 0.29. 


10.9.2 Linear Model 


In Sections 9.6 and 10.7, respectively, we presented the least squares (LS) procedure 

and a rank-based (Wilcoxon) procedure for fitting simple linear models. In this 

section, we briefly compare these procedures in terms of their robustness properties. 
Recall that the simple linear model is given by 


Y,=at+ Cra +e,, 1=1,2,...,n, (10.9.30) 
where €1,€2,..-.,€n are continuous random variable that are iid. In this model, we 
have centered the regression variables; that is, x.; = x; —Z, where 71, %2,...,%p are 


considered fixed. The parameter of interest in this section is the slope parameter 
G, the expected change (provided expectations exist) when the regression variable 
increases by one unit. The centering of the xs allows us to consider the slope 
parameter by itself. The results we present are invariant to the intercept parameter 
a. Estimates of a are discussed at the end of this section. With this in mind, define 
the random variable e; to be e; + a. Then we can write the model as 


Y¥;=Ptate, i=1,2,...,n, (10.9.31) 


where €), €2,...,€n are iid with continuous cdf F(x) and pdf f(a). We often refer 
to the support of Y as the Y-space. Likewise, we refer to the range of X as the 
X-space. The X-space is often referred to as the factor space. 


Least Squares and Wilcoxon Procedures 


The first procedure is least squares (LS). The estimating equation for (3 is given by 
expression (9.6.4) of Chapter 9. Using the fact that 7, x; = 0, this equation can 


be reexpressed as 
n 


S74 — tei8) tei = 0. (10.9.32) 


i=1 
This is the estimating equation (EE) for the LS estimator of 3, which we use in 
this section. It is often called the normal equation. It is easy to see that the LS 
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estimator is _ 
Big = pte, (10.9.33) 
et ci 
which agrees with expression (9.6.5) of Chapter 9. The geometry of the LS estimator 
is discussed in Remark 9.6.2. 

For our second procedure, we consider the estimate of slope discussed in Section 
10.7. This is a rank-based estimate based on an arbitrary score function. In this 
section, we restrict our discussion to the linear (Wilcoxon) scores; i.e., the score 
function is given by pw(u) = V12[u — (1/2)], where the subscript W denotes the 
Wilcoxon score function. The estimating equation of the rank-based estimator of (3 
is given by expression (10.7.8), which for the Wilcoxon score function is 


S > aw(R(% — 2ciB)) Zi = 0, (10.9.34) 
w=1 


where aw (i) = yw[t/(n+1)]. This equation is the analog of the LS normal equation. 
See Exercise 10.9.12 for a geometric interpretation. 


Influence Functions 


To determine the robustness properties of these procedures, first consider a prob- 
ability model corresponding to Model (10.9.31), in which X, in addition to Y, is 
a random variable. Assume that the random vector (X,Y) has joint cdf and pdf, 
H (x,y) and h(a, y), respectively, and satisfies 


Y =6X +e, (10.9.35) 


where the random variable e has cdf and pdf F(t) and f(t), respectively, and e and 
X are independent. Since we have centered the 2s, we also assume that E(X) = 0. 
As Exercise 10.9.13 shows, 


P(Y <t|X =2) = F(t— Gz), (10.9.36) 


and, hence, Y and X are independent if and only if @ = 0. 

The functional for the LS estimator easily follows from the LS normal equation 
(10.9.32). Let H,, denote the empirical cdf of the pairs (x1, y1), (2, y2),--+5(@n; Yn); 
that is, H,, is the cdf corresponding to the discrete distribution, which puts probabil- 
ity (mass) of 1/n on each point (a;, y;). Then the LS estimating equation, (10.9.32), 
can be expressed as an expectation with respect to this distribution as 


n 


i=l 


For the probability model, (10.9.35), it follows that the functional Ty,¢(H) corre- 
sponding to the LS estimate is the solution to the equation 


i. a [y — Tg (A )a]xh(a, y) dxdy = 0. (10.9.38) 
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To obtain the functional corresponding to the Wilcoxon estimate, recall the 
association between the ranks and the empirical cdf; see (10.5.14). For Wilcoxon 
scores, we have 


aw (R(¥; — 248) = ow Fi, (Yi — teiB)| - (10.9.39) 


n 
n+1 
Based on the Wilcoxon estimating equations, (10.9.34), and expression (10.9.39), 
the functional Ty (H) corresponding to the Wilcoxon estimate satisfies the equation 


I. [. ewiFly — Tw (H)a]}ah(a, y) dady = 0. (10.9.40) 


We next derive the influence functions of the LS and Wilcoxon estimators of 
G3. In regression models, we are concerned about the influence of outliers in both 
the Y- and X-spaces. Consider then a point-mass distribution with all its mass at 
the point (xo, yo), and let A(z, y,)(,y) denote the corresponding cdf. Let € denote 
the probability of sampling from this contaminating distribution, where 0 < € < 1. 
Hence, consider the contaminated distribution with cdf 


H(2,y) = (l—e)A(2, y) + Aca. yo) (2; 9): (10.9.41) 

Because the differential is a linear operator, we have 
dH.(x,y) = (1— €)dH (2, y) + dA (ay yo) (259); (10.9.42) 
where dH (x,y) = h(x, y) dxdy; that is, d corresponds to the second mixed partial 


0? /Ox Oy. 
By (10.9.38), the LS functional T. at the cdf H.(a,y) satisfies the equation 


0=(1-6 if [x y—aT,)h(a,y )dvdy +e [x y — &T) dA (eo ,40)(#, ¥)- 


(10.9.43) 
To find the partial derivative of T. with respect to €, we simply implicitly differen- 
tiate expression (10.9.43) with respect to €, which yields 


--f [ Ajahn) Gat 
soap facet 


+f fx y — 2Te) dA (eo,y9) (2, y) + €B, (10.9.44) 


T. 
h(a, y) dxdy 


where the expression for B is not needed since we are evaluating this partial at 
€=0. Notice that ate =0, y—Tex =y —Tx =y— Bx. Hence, at € = 0, the first 
expression on the right side of (10.9.44) is 0, while the second expression becomes 
—E(X?)(0T/0e), where the partial is evaluated at 0. Finally, the third expression 
is the expected value of the point-mass distribution A(,,.y,), which is, of course, 
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xo(yo — 2x0). Therefore, solving for the partial OT. /0c¢ and evaluating at € = 0, we 
see that the influence function of the LS estimator is given by 


(yo — Bao )xo_ 


eS (10.9.45) 


IF (x0, yo; Lg) = 
Note that the influence function is unbounded in both the Y- and X-spaces. Hence 
the LS estimator is unduly sensitive to outliers in both spaces. It is not robust. 


Based on expression (10.9.40), the Wilcoxon functional at the contaminated 
distribution satisfies the equation 


0 = (1-e) J r row |F(y — xT.)]h(a, y) dady 
+e i Lr tpw [Fy — tTe)| dA (ao ,yo) (2, Y) (10.9.46) 


[technically, the cdf F' should be replaced by the actual cdf of the residual, but the 
result is the same; see page 477 of Hettmansperger and McKean (2011)]. Proceeding 
to implicitly differentiate this expression with respect to €, we obtain 


o- -f 7 / __ wow! Fy ~ eT.)Ih(e,y) dedy 


40-9 ff weil Tell fly — Tx)(—1) Fn (e, y) dry 


+ / i tpwlF(y — £Te)] dA(e, yo) (t, y) + €B, (10.9.47) 


where the expression for B is not needed since we are evaluating this partial at 
e = 0. When ¢ = 0, then Y — TX = e and the random variables e and X are 
independent. Hence, upon setting « = 0, expression (10.9.47) simplifies to 


E= 


0 = Blin (FSM?) SE] + ewlFyo—2oB)leo. (1029.48) 


Since y/(u) = V12, we finally obtain, as the influence function of the Wilcoxon 

estimator, 

TewlF (yo — Bxo)]x0 
E(X?) ; 


where 7 = 1/[V/12 f f?(e) de]. Note that the influence function is bounded in the 
Y-space, but it is unbounded in the X-space. Thus, unlike the LS estimator, the 
Wilcoxon estimator is robust against outliers in the Y-space, but like the LS esti- 
mator, it is sensitive to outliers in the X-space. Weighted versions of the Wilcoxon 
estimator, though, have bounded influence in both the Y- and X-spaces; see the dis- 
cussion of the HBR estimator in Chapter 3 of Hettmansperger and McKean (2011). 
Exercises 10.9.18 and 10.9.19 asks for derivations, respectively, of the asymptotic 
distributions of the LS and Wilcoxon estimators, using their influence functions. 


IF (20, yo; Bw) = (10.9.49) 
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Breakdown Points 


Breakdown for the regression model is based on the corruption of the sample in 
Model (10.9.31), that is, the sample (#1, Y1),..., (Yen, Yn). Based on the influence 
functions for both the LS and Wilcoxon estimators, it is clear that corrupting one 
xz; breaks down both estimators. This is shown in Exercise 10.9.14. Hence the 
breakdown point of each estimator is 0. The HBR estimator (weighted version of 
the Wilcoxon estimator) has bounded influence in both spaces and can achieve 50% 
breakdown; see Chang et al. (1999) and Hettmansperger and McKean (2011). 


Intercept 


In practice, the linear model usually contains an intercept parameter; that is, the 
model is given by (10.9.30) with intercept parameter a. Notice that a is a location 
parameter of the random variables Y; — Gx,;. This suggests an estimate of location 
on the residuals Y; — Gx,;. For LS, we take the sample mean of the residuals; i.e., 


n 


apg =n (Yi — Bygaei) =, (10.9.50) 


i=l 


because the z,;s are centered. For the Wilcoxon fit, several choices seem appropriate. 
We use the median of the Wilcoxon residuals. That is, let 


Aw = medy<icn{¥i — Bw2ci}. (10.9.51) 


For the Wilcoxon fit of the regression model, computation is discussed in Remark 
10.7.1. As there, we recommend the CRAN package Rfit developed by Kloke and 
McKean (2014). The R package! hbrfit computes the high breakdown HBR fit. 


EXERCISES 


10.9.1. Consider the location model as defined in expression (10.9.1). Let 
@ = Argming||X — Allis, 


where || - lls is the square of the Euclidean norm. Show that 6 =. 


10.9.2. Obtain the sensitivity curves for the sample mean, the sample median and 
the Hodges—Lehmann estimator for the following data set. Evaluate the curves at 
the values —300 to 300 in increments of 10 and graph the curves on the same plot. 
Compare the sensitivity curves. 


—9 58 12 —-1 -37 0 11 21 
18 24 4 53 9 9 8 


Note that the R command wilcox.test(x,conf.int=T)$est computes the Hodges 
Lehmann estimate for the R vector x. 


1Downloadable at https://github.com/kloke/ 
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10.9.3. Consider the influence function for the Hodges-Lehmann estimator given 
in expression (10.9.21). Show for it that property (10.9.22) is true. Next, evaluate 
expression (10.9.23) and, hence, obtain the asymptotic distribution of the estimator 
as given in expression (10.9.25). Does it agree with the result derived in Section 
10.3? 


10.9.4. Let F(t) be the point-mass contaminated cdf given in expression (10.9.13). 
Show that 
|Frj(t) — Fx (t)] <6, 


for all t. 


10.9.5. Suppose X is a random variable with mean 0 and variance o?. Recall that 
the function F,,-(t) is the cdf of the random variable U = _.X + [1 — h_.|W, 
where X, [;_., and W are independent random variables, X has cdf F'y(t), W 
has cdf A,(t), and ;_, has a binomial(1, 1 — €) distribution. Define the functional 
Var(F'x) = Var(X) = 07. Note that the functional at the contaminated cdf F,.<(t) 
has the variance of the random variable U = ,_.X + [1 — _.]W. To derive the 
influence function of the variance, perform the following steps: 


(a) Show that E(U) = ex. 
(b) Show that Var(U) = (1 — €)o? + ex? — ez. 


(c) Obtain the partial derivative of the right side of this equation with respect to 
e. This is the influence function. 


Hint: Because I;_, is a Bernoulli random variable, i = I,_,.. Why? 


10.9.6. Often influence functions are derived by differentiating implicitly the defin- 
ing equation for the functional at the contaminated cdf F, -(t), (10.9.13). Consider 
the mean functional with the defining equation (10.9.10). Using the linearity of the 
differential, first show that the defining equation at the cdf F,,.(t) can be expressed 
as 


Co 


o= | [t-—T(Fr,_)|dFr(t) = a-9 f [t — T(Fe,e)] fx (t) dt 


—oco —co 


+ | [t — T(Fy,<)| dA(t). — (10.9.52) 
Recall that we want OT (F,,-)/Oe. Obtain this by implicitly differentiating the above 
equation with respect to e. 


10.9.7. In Exercise 10.9.5, the influence function of the variance functional was de- 
rived directly. Assuming that the mean of X is 0, note that the variance functional, 
V (Fx), also solves the equation 


0= i: [t? — V(Fx)] fx (t) dt. 


—Co 
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(a) Determine the natural estimator of the variance by writing the defining equa- 
tion at the empirical cdf F,,(t), for X; — X,...,X» — X iid with cdf F(t), 
and solving for V(F;,). 


(b) As in Exercise 10.9.6, write the defining equation for the variance functional 
at the contaminated cdf F;,,.(t). 


(c) Then derive the influence function by implicit differentiation of the defining 
equation in part (b). 
10.9.8. Show that the inverse of the cdf F,,-(t) given in expression (10.9.17) is 
correct. 


10.9.9. Let IF(x) be the influence function of the sample median given by (10.9.20). 
Determine E/IF(X)]| and Var[IF(X)]. 

10.9.10. Let 21,22,...,@, be a realization of a random sample. Consider the 
Hodges—Lehmann estimate of location given in expression (10.9.4). Show that the 
breakdown point of this estimate is 0.29. 

Hint: Suppose we corrupt m data points. We need to determine the value of m that 
results in corruption of one-half of the Walsh averages. Show that the corruption 
of m data points leads to 


m 
p(n) = m+ (1) + n(n =m) 
corrupted Walsh averages. Hence the finite sample breakdown point is the “correct” 
solution of the quadratic equation p(m) = n(n + 1)/4. 
10.9.11. For any n x 1 vector v, define the function ||v||y by 


IIlvilw = Yaw vi, (10.9.53) 


where R(v;) denotes the rank of v; among v1,...,Un and the Wilcoxon scores are 
given by aw(t) = yw|i/(n + 1)] for pw(u) = V12[u — (1/2)]. By using the corre- 
spondence between order statistics and ranks, show that 


n 


lIlvilw = So ali), (10.9.54) 
i=1 
where v1) < +++ < Um) are the ordered values of v1,...,Un. Then, by establishing 


the following properties, show that the function (10.9.53) is a pseudo-norm on 
RR”, 
(a) ||v|lw > 0 and ||v||w = 0 if and only if vy = vg = +++ = Up. 
Hint: First, because the scores a(7) sum to 0, show that 


n 


Yo aoa = 5 a) Pw — %] + DS eMlow — owl, 


i=l <j i>j 


where j is the largest integer in the set {1,2,...,n} such that a(j) < 0. 
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(b) llevilw =lelllvllw, for allee R. 


(c) ||v+wllw < |lv\lw + |wllw, for all v,w € R”. 


Hint: Determine the permutations, say, i, and j, of the integers {1,2,...,n}, 
which maximize )~?_, c;,d;, for the two sets of numbers {cj,...,¢n} and 
k=1 “tk Jk 


{di, a cys 


10.9.12. Remark 9.6.2 discusses the geometry of the LS estimate of 3. There is an 
analogous geometry for the Wilcoxon estimate. Using the norm || - ||w defined in 
expression (10.9.53) of the last exercise, let 


@* = Argmin |[Y — X.6]lw, 


where Y’ = (Y1,...,¥,) and X/, = (a,...,%en). Thus B minimizes the distance 
between Y and the space spanned by the vector X.¢. 


(a) Using expression (10.9.54), show that B* satisfies the Wilcoxon estimating 
equation (10.9.34). That is, 6* = By. 


(b) Let Yw = X.8w and Y — Yw denote the Wilcoxon vectors of fitted values 
and residuals, respectively. Sketch a figure analogous to the LS Figure 9.6.3 
but with these vectors on it. Note that your figure may not contain a right 
angle. 


(c) For the Wilcoxon regression procedure, determine a vector (not 0) that is 
orthogonal to Yw. 


10.9.13. For Model (10.9.35), show that equation (10.9.36) holds. Then show that 
Y and X are independent if and only if 6 = 0. Hence independence is based on the 
value of a parameter. This is a case where normality is not necessary to have this 
independence property. 


10.9.14. Consider the telephone data discussed in Example 10.7.2 and given in the 
rda-file telephone. rda. It is easily seen in Figure 10.7.1 that there are seven outliers 
in the Y—space. Based on the estimates discussed in this example, the Wilcoxon 
estimate of slope is robust to these outliers, while the LS estimate is highly sensitive 
to them. 


(a) For this data set, change the last value of x from 73 to 173. Notice the drastic 
change in the LS fit. 


(b) Obtain the Wilcoxon estimate for the changed data in part (a). Notice that 
it has a drastic change also. To obtain the Wilcoxon fit, see Remark 10.7.1 
on computation. 


(c) Using the Wilcoxon estimates of Example 10.7.2, change the the value of Y 
at x = 173 to the predicted value of Y based on the Wilcoxon estimates of 
Example 10.7.2. Note that this point is a “good” point at the outlying z; 
that is, it fits the model. Now determine the Wilcoxon and LS estimates. 
Comment on them. 
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10.9.15. For the pseudo-norm ||v||yv defined in expression (10.9.53), establish the 


identity 
V3 nm n 
= ——_ 4 — V;i, 10.9.55 
Ww = sa he (10.9.55) 
for all v € R”. Thus we have shown that 
Bw = Argmin )> S- (yi — ¥3) — Blea — Leq)). (10.9.56) 


i=1 j=1 


Note that the formulation of By given in expression (10.9.56) allows an easy way to 
compute the Wilcoxon estimate of slope by using an L, (least absolute deviations) 
routine. Terpstra and McKean (2005) used this identity, (10.9.55), to develop R 
functions for the computation of the Wilcoxon fit. 


10.9.16. Suppose the random variable e has cdf F(t). Let p(w) = V12[u — (1/2)], 
0 <u <1, denote the Wilcoxon score function. 


(a) Show that the random variable y[F'(e;)] has mean 0 and variance 1. 


(b) Investigate the mean and variance of y[F(e;)] for any score function y(u) 
which satisfies i, y(u) du = 0 and i. p?(u) du = 1. 


10.9.17. In the derivation of the influence function, we assumed that x was ran- 
dom. For inference, though, we consider the case that x is given. In this case, the 
variance of X, E(X7), which is found in the influence function, is replaced by its 
estimate, namely, n~! S>\"_, x2;. With this in mind, use the influence function of 
the LS estimator of @ to derive the asymptotic distribution of the LS estimator; 
see the discussion around expression (10.9.24). Show that it agrees with the exact 
distribution of the LS estimator given in expression (9.6.9) under the assumption 
that the errors have a normal distribution. 


10.9.18. As in the last problem, use the influence function of the Wilcoxon esti- 
mator of @ to derive the asymptotic distribution of the Wilcoxon estimator. For 
Wilcoxon scores, show that it agrees with expression (10.7.14). 


10.9.19. Use the results of the last two exercises to find the asymptotic relative 
efficiency (ARE) between the Wilcoxon and LS estimators of (3. 


This page intentionally left blank 


Chapter 11 


Bayesian Statistics 


11.1 Bayesian Procedures 


To understand the Bayesian inference, let us review Bayes Theorem, (1.4.3), in a 
situation in which we are trying to determine something about a parameter of a 
distribution. Suppose we have a Poisson distribution with parameter 0 > 0, and we 
know that the parameter is equal to either 6 = 2 or 0 = 3. In Bayesian inference, the 
parameter is treated as a random variable ©. Suppose, for this example, we assign 
subjective prior probabilities of P(@ = 2) = 4 and P(@ = 3) = § to the two 
possible values. These subjective probabilities are based upon past experiences, 
and it might be unrealistic that © can only take one of two values, instead of a 
continuous 6 > 0 (we address this immediately after this introductory illustration). 
Now suppose a random sample of size n = 2 results in the observations 7; = 2, 
v2 = 4. Given these data, what are the posterior probabilities of 0 = 2 and 


© = 3? By Bayes Theorem, we have 


P(Q =2|X, = 2, X2 =4) 

P(® =2 and X; = 2, X2=4) 
P(X, = 2, X2 = 4|0 = 2)P(O = 2) + P(X; = 2, Xp = 4/0 = 3)P(8 = 3) 
(3) ar = 0.245. 


71) ene 22 en 2 24, /2\ E7332 E7334 
(3) Sr Se + (3) SS Sa 


Similarly, 


P(® =3|X, =2, Xp = 4) =1—0.245 = 0.755. 


That is, with the observations 71 = 2,22 = 4, the posterior probability of O = 2 
was smaller than the prior probability of 0 = 2. Similarly, the posterior probability 
of O = 3 was greater than the corresponding prior. That is, the observations 
XZ, = 2,x%2 = 4 seemed to favor O = 3 more than O = 2; and that seems to agree 
with our intuition as = 3. Now let us address in general a more realistic situation 
in which we place a prior pdf h(@) on a support that is a continuum. 
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11.1.1 Prior and Posterior Distributions 


We now describe the Bayesian approach to the problem of estimation. This approach 
takes into account any prior knowledge of the experiment that the statistician has 
and it is one application of a principle of statistical inference that may be called 
Bayesian statistics. Consider a random variable X that has a distribution of 
probability that depends upon the symbol 6, where @ is an element of a well-defined 
set . For example, if the symbol @ is the mean of a normal distribution, Q may 
be the real line. We have previously looked upon @ as being a parameter, albeit 
an unknown parameter. Let us now introduce a random variable © that has a 
distribution of probability over the set Q; and just as we look upon « as a possible 
value of the random variable X, we now look upon @ as a possible value of the 
random variable ©. Thus, the distribution of X depends upon 6, an experimental 
value of the random variable ©. We denote the pdf of © by h(@) and we take 
h(@) = 0 when @ is not an element of 2. The pdf h(@) is called the prior pdf of O. 
Moreover, we now denote the pdf of X by f(a|@) since we think of it as a conditional 
pdf of X, given O = @. For clarity in this chapter, we use the following summary 
of this model: 


X|0 ~ f(2\6) 
Oo ~ h(6). (11.1.1) 
Suppose that X 1, X2,...,X, is a random sample from the conditional distri- 


bution of X given 0 = 6 with pdf f(x|@). Vector notation is convenient in this 
chapter. Let X’ = (X1, Xo,...,X») and x’ = (#1, 22,...,2,). Thus we can write 
the joint conditional pdf of X, given O = @, as 


L(x | 8) = f(r1|0) f(w21) +> f(anl9). (11.1.2) 
Thus the joint pdf of X and © is 

g(x, 0) = L(x|6)h(8). (11.1.3) 
If © is a random variable of the continuous type, the joint marginal pdf of X is 
given by 

gi(x) = [- g(x, 0) do. (11.1.4) 


If © is a random variable of the discrete type, integration would be replaced by 
summation. In either case, the conditional pdf of 0, given the sample X, is 

0 L(x| @)h(6 
g(x.) _ Lee] @)h(0) er 
g(x) g1(X) 


k(6|x) = 


The distribution defined by this conditional pdf is called the posterior distribu- 
tion and (11.1.5) is called the posterior pdf. The prior distribution reflects the 
subjective belief of @ before the sample is drawn, while the posterior distribution is 
the conditional distribution of © after the sample is drawn. Further discussion on 
these distributions follows an illustrative example. 
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Example 11.1.1. Consider the model 


X;|09 ~_ iid Poisson(0) 
0 ~ I(a,Z),a and f are known. 
Hence the random sample is drawn from a Poisson distribution with mean 6 and 


the prior distribution is a T'(a, 3) distribution. Let X’ = (X1, X2,...,Xn). Thus, 
in this case, the joint conditional pdf of X, given O = 0, (11.1.2), is 


L(x|0) = vee Ste S001 Dee ht 1 


Hence the joint mixed continuous-discrete pdf is given by 


a(x. 0) = Lx ayn(o) = [RS ... Pe] A) 


zr! Ly! 
provided that 7; = 0,1,2,3,..., 2 = 1,2,...,n, and 0 < 46 < ow, and is equal to 
zero elsewhere. Then the marginal distribution of the sample, (11.1.4), is 


r (ds + >) 


oo 9x ai +a—1_—(n+1/B)0 
g(x) = | —_——..—— 4 
0 


al tyE(a)B* ay! aE (a) (n+ 1/syurre 
(11.1.6) 
Finally, the posterior pdf of ©, given X = x, (11.1.5), is 
L(x|4)h E eita-1p-0/[8/(nB+1)) 
(6 |x) = LIOR : 1.1.7) 


gi (x) 7 P (doz +a) [B/(nB + 1)=ate 


provided that 0 < 6 < oo, and is equal to zero elsewhere. This conditional pdf is of 
the gamma type, with parameters a* = S7i_, 2; + a and @* = B/(nB +1). Notice 
that the posterior pdf reflects both prior information (a, 3) and sample information 


(doj1 Zi). 


In Example 11.1.1, notice that it is not really necessary to determine the marginal 
pdf gi(x) to find the posterior pdf k(6|x). If we divide L(x|0)h(@) by gi(x), we 
must get the product of a factor that depends upon x but does not depend upon 6, 
say c(x), and 

Q& tita-1,—6/[8/(nB+1)] 


That is, 
k(8|x) = c(x)O2 TO —9/18/ (mB +10], 
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provided that 0 < 6 < oo and 2; = 0,1,2,..., i =1,2,...,n. However, c(x) must 
be that “constant” needed to make k(@|x) a pdf, namely, 


1 


a (Soi +a) (8/(nb + YJEer~ 


Accordingly, we frequently write that k(0|x) is proportional to L(x|0)h(@); that is, 
the posterior pdf can be written as 


k(6|x) x L(x|0)h(0). (11.1.8) 


Note that in the right-hand member of this expression, all factors involving con- 
stants and x alone (not @) can be dropped. For illustration, in solving the problem 
presented in Example 11.1.1, we simply write 


k(O|x) oc 9% te" 92-1 e- 9/8 


or, equivalently, 
k(0|x) x gu tita—1_-6/[8/(mB+1)], 


0 < @ < ~, and is equal to zero elsewhere. Clearly, k(@|x) must be a gamma pdf 
with parameters a* = )> x; + @ and G* = G/(nG +1). 

There is another observation that can be made at this point. Suppose that there 
exists a sufficient statistic Y = u(X) for the parameter so that 


L(x | 0) = glu(x)|O]A(x), 
where now g(y|0) is the pdf of Y, given O = 0. Then we note that 
K(8|x) x glu(x)|@]A(@) 


because the factor H(x) that does not depend upon 6 can be dropped. Thus, if a 
sufficient statistic Y for the parameter exists, we can begin with the pdf of Y if we 
wish and write 


K(Aly) x g(yl@)h(4), (11.1.9) 


where now k(@|y) is the conditional pdf of © given the sufficient statistic Y = y. In 
the case of a sufficient statistic Y, we also use gi(y) to denote the marginal pdf of 
Y; that is, in the continuous case, 


ny) = a g(y|@)h(0) do. 


—co 


11.1.2 Bayesian Point Estimation 


Suppose we want a point estimator of 6. From the Bayesian viewpoint, this really 
amounts to selecting a decision function 6, so that 6(x) is a predicted value of 6 
(an experimental value of the random variable @) when both the computed value x 
and the conditional pdf k(@|x) are known. Now, in general, how would we predict 
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an experimental value of any random variable, say W, if we want our prediction to 
be “reasonably close” to the value to be observed? Many statisticians would predict 
the mean, E(W), of the distribution of W; others would predict a median (perhaps 
unique) of the distribution of W; and some would have other predictions. However, 
it seems desirable that the choice of the decision function should depend upon a loss 
function £[0, 6(x)]. One way in which this dependence upon the loss function can 
be reflected is to select the decision function 6 in such a way that the conditional 
expectation of the loss is a minimum. A Bayes estimate is a decision function 6 
that minimizes 


E{L[O, 6(x)]|K =x} = / L[O, 6(x)|k(6|x) dé 
if O is a random variable of the continuous type. That is, 
d(x) = Argmin [ L[O, 6(x)]k(0|x) do. (11.1.10) 


The associated random variable 6(X) is called a Bayes estimator of 0. The usual 
modification of the right-hand member of this equation is made for random variables 
of the discrete type. If the loss function is given by L[0,5(x)] = [0 — 6(x)]?, then 
the Bayes estimate is (x) = E(@|x), the mean of the conditional distribution of 0, 
given X =x. This follows from the fact that E[(W — 6)?], if it exists, is a minimum 
when b = E(W). If the loss function is given by £[0,6(x)] = |@ — d(x)|, then a 
median of the conditional distribution of 0, given X = x, is the Bayes solution. 
This follows from the fact that E(|W — 6]), if it exists, is a minimum when 0 is equal 
to any median of the distribution of W. 

It is easy to generalize this to estimate a specified function of 0, say, 1(@). For 
the loss function £[0, 6(x)], a Bayes estimate of /(@) is a decision function 6 that 
minimizes 


E{C[I(@), 6(x)]|[X = x} = [4 L{L(8), 6(xx)] (Ox) a9 


The random variable 6(X) is called a Bayes estimator of /(0). 

The conditional expectation of the loss, given X = x, defines a random variable 
that is a function of the sample X. The expected value of that function of X, in 
the notation of this section, is given by 


ss [4 ae Wee) a0 g (x )dx= f { [4 £10, 5(x )E(x)6) ax} (oa, 


in the continuous case. The integral within the braces in the latter expression is, 
for every given 6 € 0, the risk function R(6,6); accordingly, the latter expression 
is the mean value of the risk, or the expected risk. Because a Bayes estimate 6(x) 


minimizes 
[4 L[O, 0(x)|k(6|x) dO 


for every x for which g(x) > 0, it is evident that a Bayes estimate d(x) minimizes 
this mean value of the risk. We now give two illustrative examples. 
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Example 11.1.2. Consider the model 


X,|0. ~ iid binomial, b(1, 0) 
0 ~  beta(a, 3), a and 6 are known; 


that is, the prior pdf is 


P(a)T(B) 


n(0) = Fath) ge-1(1 98-1 0<0<1 
~ 10 elsewhere., 


where a and £ are assigned positive constants. We seek a decision function 6 that 
is a Bayes solution. The sufficient statistic is Y = S07 Xi, which has a b(n, 6) 
distribution. Thus the conditional pdf of Y given O = @ is 


()eHa— OY y= Odean 
= y 
g(y|@) { 0 elsewhere. 


Thus, by (11.1.9), the conditional pdf of ©, given Y = y at points of positive 
probability density, is 


k(Oly) x 6¥(1 — 0)” "49" 4(1—9)F 1, O<O<1. 


That is, 
T(in+a+) ye Beccles 
k(6|y) = ——————__ 0°11 — Pe) 60 <1, 
OM Tergrete-y 
and y = 0,1,...,n. Hence the posterior pdf is a beta density function with param- 


eters (a+y,3+n-—y). We take the squared-error loss, i.e., L[O, 5(y)] = [9 —6(y)]*, 
as the loss function. Then, the Bayesian point estimate of is the mean of this beta 
pdf, which is 
at+y 
d(y) = ———_.. 

(y) a+fp+n 


It is very instructive to note that this Bayes estimator can be written as 


sw) = ( n )£+( | a 

a+tB+n] n a+tB+n}/a+fp 

which is a weighted average of the maximum likelihood estimate y/n of 6 and the 
mean a/(a@+ (3) of the prior pdf of the parameter. Moreover, the respective weights 
are n/(a+8+n) and (a+ 8)/(a+6+n). Note that for large n, the Bayes estimate 
is close to the maximum likelihood estimate of @ and that, furthermore, 6(Y) is a 
consistent estimator of 6. Thus we see that a@ and ( should be selected so that not 
only is a/(a+ 3) the desired prior mean, but the sum a+ indicates the worth of 
the prior opinion relative to a sample of size n. That is, if we want our prior opinion 
to have as much weight as a sample size of 20, we would take a+ 3 = 20. So if our 


prior mean is 3, we have that a and £ are selected so that a=15 and G=5. m 
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Example 11.1.3. For this example, we have the normal model, 


X;|0 ~ iid N(0,07), where o? is known 
© ~ N(6o,04),where % and o? are known. 
Then Y = X is a sufficient statistic. Hence an equivalent formulation of the model 
is 
Y|@ ~ N(6,07/n), where o? is known 


© ~ N(6o,04),where 0 and o@ are known. 


Then for the posterior pdf, we have 


‘e(Otu) ve on) ee a 


1 1 
— _—__ —=—_ exp 5 5 
V2ra//n V2r00 2(0?/n) 206 
If we eliminate all constant factors (including factors involving only y), we have 


[og + (o7/n)]O? — 2lyog + a 


Kai) «exe - Xe ]n)o3 


This can be simplified by completing the square to read (after eliminating factors 
not involving @) 


__ yoR + 00(0?/n)\” 
(° a + (o?/n) ) 
2(07/n)o? 

[23 + (o?/n)] 


(Bly) o exp | — 


That is, the posterior pdf of the parameter is obviously normal with mean 


yo + Bo(07/n) 


arom 7 (aeem)’t (aererm)® — obtan 


and variance (0? /n)o? /|[o§ + (o?/n)]. If the squared-error loss function is used, this 
posterior mean is the Bayes estimator. Again, note that it is a weighted average 
of the maximum likelihood estimate y = and the prior mean 0 . As in the 
last example, for large n, the Bayes estimator is close to the maximum likelihood 
estimator and 6(Y) is a consistent estimator of 0. Thus the Bayesian procedures 
permit the decision maker to enter his or her prior opinions into the solution in a 
very formal way such that the influences of these prior notions are less and less as 
mn increases. Hl 


In Bayesian statistics, all the information is contained in the posterior pdf k(6|y). 
In Examples 11.1.2 and 11.1.3, we found Bayesian point estimates using the squared- 
error loss function. It should be noted that if L[d(y), 4] = |d(y) — O|, the absolute 
value of the error, then the Bayes solution would be the median of the posterior 
distribution of the parameter, which is given by k(6|y). Hence the Bayes estimator 
changes, as it should, with different loss functions. 
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11.1.3. Bayesian Interval Estimation 


If an interval estimate of 0 is desired, we can find two functions u(x) and v(x) so 
that the conditional probability 


u(x) 
Plu(x) < © < v(x)|K =x] = a ' k(0|x) dO 


is large, for example, 0.95. Then the interval u(x) to v(x) is an interval estimate 
of # in the sense that the conditional probability of © belonging to that interval is 
equal to 0.95. These intervals are often called credible or probability intervals, 
so as not to confuse them with confidence intervals. 


Example 11.1.4. Consider Example 11.1.3, where X 1, X2,...,X, is a random 
sample from a N(6,07) distribution, where o? is known, and the prior distribution 
is a normal N(69,0%) distribution. The statistic Y = X is sufficient. Recall that 
the posterior pdf of © given Y = y was normal with mean and variance given near 
expression (11.1.11). Hence a credible interval is found by taking the mean of the 
posterior distribution and adding and subtracting 1.96 of its standard deviation; 
that is, the interval 


yo? + 60(02/n) (a2 /n)o? 
oe (o2/n) "a2 + (o2/n) 


forms a credible interval of probability 0.95 for 0. m 


Example 11.1.5. Recall Example 11.1.1, where X/ = (X,, X9,..., Xp) isarandom 
sample from a Poisson distribution with mean 6 and a I'(a,) prior, with a and 
@ known, is considered. As given by expression (11.1.7), the posterior pdf is a 
T'(y + a, B/(n8 + 1)) pdf, where y = 50", x;. Hence, if we use the squared-error 
loss function, the Bayes point estimate of @ is the mean of the posterior 


_Byta)_ mB yap 
(BHT Btn * WB+T 


As with the other Bayes estimates we have discussed in this section, for large n 
this estimate is close to the maximum likelihood estimate and the statistic d(Y) 
is a consistent estimate of 9. To obtain a credible interval, note that the posterior 
distribution of ane) is x?(2()05_, v1 +a)). Based on this, the following interval 
is a (1 — a)100% credible interval for 0: 

(11.1. 


(rary tom 2(S2 +4) | areata e(Sa +4) 
12) 


where X7_(q72)(2(X0j1 ti + @)) and x2, )9(2(07_, 21 + @)) are the lower and upper 
x? quantiles for a y? distribution with 2(>;"_, x; + a) degrees of freedom. m 
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11.1.4 Bayesian Testing Procedures 


As above, let X be a random variable with pdf (pmf) f(a|0), 6 € 2. Suppose we 
are interested in testing the hypotheses 


Ho: 0 € wo versus Hy: 0 € wu, 


where wp Uw, = 2 and wy) Mw, = ¢. A simple Bayesian procedure to test these 
hypotheses proceeds as follows. Let h(@) denote the prior distribution of the prior 
random variable 0; let X’ = (X1, X2,...,X,) denote a random sample on X; and 
denote the posterior pdf or pmf by k(6|x). We use the posterior distribution to 
compute the following conditional probabilities: 


P(© €wo|x) and P(O € w4|x). 


In the Bayesian framework, these conditional probabilities represent the truth of 
Hy and Hy, respectively. A simple rule is to 


Accept Ho if P(O € wo|x) > P(O € w |x); 


otherwise, accept H,; that is, accept the hypothesis that has the greater conditional 
probability. Note that the condition wo 1 w, = ¢ is required, but wo Uw, = 2 is 
not necessary. More than two hypotheses may be tested at the same time, in which 
case a simple rule would be to accept the hypothesis with the greater conditional 
probability. We finish this subsection with a numerical example. 


Example 11.1.6. Referring again to Example 11.1.1, where X’ = (X1, Xo,..., Xn) 
is arandom sample from a Poisson distribution with mean 0, suppose we are inter- 
ested in testing 

Hy: 6 <10 versus Hy: 6 > 10. (11.1.13) 


Suppose we think 6 is about 12, but we are not quite sure. Hence we choose the 
['(10, 1.2) pdf as our prior, which is shown in the left panel of Figure 11.1.1. The 
mean of the prior is 12, but as the plot shows, there is some variability (the variance 
of the prior distribution is 14.4). The data for the problem are 


ll 7 11 6 5 9 14 10 9 5 
8 10 8 10 12 9 3 12 14 4 


(these are the values of a random sample of size n = 20 taken from a Poisson 
distribution with mean 8; of course, in practice we would not know the mean is 8). 
The value of the sufficient statistic is y = een x; = 177. Hence, from Example 
11.1.1, the posterior distribution is a T'(177 + 10, 1.2/[20(1.2) + 1]) = T'(187, 0.048) 
distribution, which is shown in the right panel of Figure 11.1.1. Note that the 
data have moved the mean to the left of 12 to 187(0.048) = 8.976, which is the 
Bayes estimate (under squared-error loss) of 6. Using R, we compute the posterior 
probability of Hp as 


P[© < 10|y = 177] = P[F (187, 0.048) < 10] = pgamma(10, 187, 1/0.048) = 0.9368. 
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Figure 11.1.1: Prior (left panel) and posterior (right panel) pdfs of Example 11.1.6 


Thus P[O > 10|y = 177] = 1—0.9368 = 0.0632; consequently, our rule would accept 
He. 

The 95% credible interval, (11.1.12), is (7.77, 10.31), which also contains 10; see 
Exercise 11.1.7 for details. m 


11.1.5 Bayesian Sequential Procedures 


Finally, we should observe what a Bayesian would do if additional data were col- 
lected beyond 21, %2,...,% . In such a situation, the posterior distribution found 
with the observations 271, %2,...,2n becomes the new prior distribution, additional 
observations give a new posterior distribution, and inferences would be made from 
that second posterior. Of course, this can continue with even more observations. 
That is, the second posterior becomes the new prior, and the next set of observa- 
tions yields the next posterior from which the inferences can be made. Clearly, this 
gives Bayesians an excellent way of handling sequential analysis. They can continue 
taking data, always updating the previous posterior, which has become a new prior 
distribution. Everything a Bayesian needs for inferences is in that final posterior 
distribution obtained by this sequential procedure. 


EXERCISES 


11.1.1. Let Y have a binomial distribution in which n = 20 and p = @. The prior 
probabilities on @ are P(@ = 0.3) = 2/3 and P(@ = 0.5) = 1/3. If y = 9, what are 
the posterior probabilities for 6 = 0.3 and @ = 0.5? 
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11.1.2. Let X41, X2,..., Xp, be a random sample from a distribution that is b(1, 0). 
Let the prior of 0 be a beta one with parameters a and 3. Show that the posterior 
pdf k(0|a1,72,...,@n) is exactly the same as k(6|y) given in Example 11.1.2. 


11.1.3. Let X1,X2,...,X, denote a random sample from a distribution that is 
N(0, 07), where —o0 < 6 < 00 and o? is a given positive number. Let Y = X denote 
the mean of the random sample. Take the loss function to be £L[6, 6(y)] = |@— 6(y)|. 
If @ is an observed value of the random variable © that is N(,7?), where 7? > 0 
and 4s are known numbers, find the Bayes solution 6(y) for a point estimate 0. 


11.1.4. Let X1, X2,...,X, denote a random sample from a Poisson distribution 
with mean 0, 0 < 6 < co. Let Y = )°} Xj. Use the loss function L(A, 5(y)] = 
[0 — 6(y)|?. Let @ be an observed value of the random variable ©. If © has the prior 
pdf h(0) = 0° te-9/8 /T(a) 8%, for 0 < @ < oo, zero elsewhere, where a > 0, 3 > 0 
are known numbers, find the Bayes solution 6(y) for a point estimate for 0. 


11.1.5. Let Y,, be the nth order statistic of a random sample of size n from a 
distribution with pdf f(az|@) = 1/0, 0 < x < @, zero elsewhere. Take the loss 
function to be L[O,6(y)] = [@ — 5(yn)|?. Let @ be an observed value of the random 
variable @, which has the prior pdf h(@) = Ga? /0°+!, a < @ < oo, zero elsewhere, 
with a >0, 6 > 0. Find the Bayes solution 6(y,,) for a point estimate of 0. 


11.1.6. Let Y; and Y2 be statistics that have a trinomial distribution with param- 
eters n, 0;, and #2. Here 6, and 62 are observed values of the random variables 0,1 
and 92, which have a Dirichlet distribution with known parameters a), a2, and 
a3; see expression (3.3.10). Show that the conditional distribution of 0; and Og is 
Dirichlet and determine the conditional means E(©;|y1, y2) and E(O2|y1, y2). 


11.1.7. For Example 11.1.6, obtain the 95% credible interval for 6. Next obtain the 
value of the mle for 0 and the 95% confidence interval for 6 discussed in Chapter 6. 


11.1.8. In Example 11.1.2, let n = 30, a = 10, and 6 = 5, so that 6(y) = (10+y)/45 
is the Bayes estimate of 6. 


(a) If Y has a binomial distribution b(30, 0), compute the risk E{[@ — 6(Y)]?}. 


(b) Find values of @ for which the risk of part (a) is less than 0(1 — @)/30, the risk 
associated with the maximum likelihood estimator Y/n of 0. 


11.1.9. Let Y4 be the largest order statistic of a sample of size n = 4 from a 
distribution with uniform pdf f(2;@) = 1/0, 0 < a < 0, zero elsewhere. If the prior 
pdf of the parameter g(9) = 2/68, 1 < 0 < ov, zero elsewhere, find the Bayesian 
estimator 6(Y4) of 0, based upon the sufficient statistic Y4, using the loss function 


[5(ya) — Ol. 


11.1.10. Refer to Example 11.2.3; suppose we select oj = do”, where o? was known 
in that example. What value do we assign to d so that the variance of posterior is 
two-thirds the variance of Y = X, namely, o?/n? 
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11.2 More Bayesian Terminology and Ideas 


Suppose X! = (X 1, X2,..., Xn) represents a random sample with likelihood L(x|0) 
and we assume a prior pdf h(@). The joint marginal pdf of X is given by 


g(x) = ‘l 7 L(x|0)h(0)d0. 


—co 


This is often called the pdf of the predictive distribution of X because it provides 
the best description of the probabilities about X given the likelihood and the prior. 
An illustration of this is provided in expression (11.1.6) of Example 11.1.1. Again 
note that this predictive distribution is highly dependent on the probability models 
for X and O. 

In this section, we consider two classes of prior distributions. The first class is 
the class of conjugate priors defined by: 


Definition 11.2.1. A class of prior pdfs for the family of distributions with pdfs 
f(al@), 8 € Q, ts said to define a conjugate family of distributions if the 
posterior pdf of the parameter is in the same family of distributions as the prior. 


As an illustration, consider Example 11.1.5, where the pmf of X; given 0 was 
Poisson with mean 6. In this example, we selected a gamma prior and the resulting 
posterior distribution was of the gamma family also. Hence the gamma pdf forms 
a conjugate class of priors for this Poisson model. This was true also for Example 
11.1.2 where the conjugate family was beta and the model was a binomial, and for 
Example 11.1.3, where both the model and the prior were normal. 

To motivate our second class of priors, consider the binomial model, b(1,@), 
presented in Example 11.1.2. Thomas Bayes (1763) took as a prior the beta dis- 
tribution with a = 6 = 1, namely h(@) = 1,0 < 0 < 1, zero elsewhere, because he 
argued that he did not have much prior knowledge about 6. However, we note that 
this leads to the estimate of 


(ts) ©) + (Gs) G). 


We often call this a shrinkage estimate because the estimate y/n is pulled a lit- 
tle toward the prior mean of 1/2, although Bayes tried to avoid having the prior 
influence the inference. 

Haldane (1948) did note, however, that if a prior beta pdf exists with a = 6 = 0, 
then the shrinkage estimate would reduce to the mle y/n. Of course, a beta pdf 
with a = 6 = 0 is not a pdf at all, for it would be such that 


1 


— 6<1 
aaa 


zero elsewhere, and 


faa 
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does not exist. However, such priors are used if, when combined with the likelihood, 
we obtain a posterior pdf which is a proper pdf. By proper, we mean that it 
integrates to a positive constant. In this example, we obtain the posterior pdf of 


F (Aly) « OY *(1 — A)P-¥, 


which is proper provided y > 0 and n—y > 0. Of course, the posterior mean is 
y/n. 


Definition 11.2.2. Let X! = (X1, X2,...,Xn) be a random sample from the dis- 
tribution with pdf f(«|@). A prior h(@) > 0 for this family is said to be improper 
if it is not a pdf, but the function k(0|x) « L(x|@)h(0) can be made proper. 


A noninformative prior is a prior that treats all values of 0 the same, that is, 
uniformly. Continuous noninformative priors are often improper. As an example, 
suppose we have a normal distribution N(@1,@2) in which both 6; and 62 > 0 are 
unknown. A noninformative prior for 6; is hi(@:) = 1, —oo < 6; < oo. Clearly, 
this is not a pdf. An improper prior for 62 is ha(@2) = c2/2, 0 < 02 < ov, zero 
elsewhere. Note that log 62 is uniformly distributed between —oo < log 62 < oo. 
Hence, in this way, it is a noninformative prior. In addition, assume the parameters 
are independent. Then the joint prior, which is improper, is 


hy (01) h2 (02) x 1/62, -70o <4, < o,0< 02 << O. (11.2.1) 
Using this prior, we present the Bayes solution for 6; in the next example. 


Example 11.2.1. Let X1, X2,..., Xn be a random sample from a N (61, 62) distri- 
bution. Recall that X and S$? = (n—1)~1 >“, (X; — X)? are sufficient statistics. 
Suppose we use the improper prior given by (11.2.1). Then the posterior distribution 
is given by 


ke1a(01, 90/2, 82) o (=) (se) | a 1)s? + n(z 0)°} | 
. (=) — 5 {(m —1)s? + n(x — 61)?} /o2| 


To get the conditional pdf of 01, given Z and s?, we integrate out 05 


ky (61 |Z, s?) => | k12(61, O2|Z, s”)d02. 
0) 


To carry this out, let us change variables z = 1/62 and 62 = 1/z, with Jacobian 
—1/z?. Thus 


oo R41 = 2 a 2 
ky (04|%, s*) x | & 5 exp - {ae | q dz. 
0 z 


2 


Referring to a gamma distribution with a = n/2 and 8 = 2/{(n—1)s?+n(%—61)7}, 
this result is proportional to 


ky (01|Z, 8”) x {(n — 1)s? + n(@ — 6)?}7-"””. 
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Let us change variables to get more familiar results; namely, let 


Poe 
t= and 0, =%+ts//n, 


with Jacobian s/./n. This conditional pdf of t, given 7 and s?, is then 


k(t|z, s”) oe {(n = 1)s? 4 Gri? 
1 


That is, the conditional pdf of t = (0, —Z)/(s/n), given Z and s?, is a Student t with 
n — 1 degrees of freedom. Since the mean of this pdf is 0 (assuming that n > 2), it 
follows that the Bayes estimator of 6,, under squared-error loss, is X, which is also 
the mle. 

Of course, from k1(6,|Z, s?) or k(t|Z, s?), we can find a credible interval for 61. 
One way of doing this is to select the highest density region (HDR) of the pdf 
6, or that of t. The former is symmetric and unimodal about 6; and the latter 
about zero, but the latter’s critical values are tabulated; so we use the HDR of that 
t-distribution. Thus, if we want an interval having probability 1— a, we take 


—tas2< < tase 


6, -—=& 
s/n 
or, equivalently, 

E —tyjos//n <0, < E+ ty2s/Vn. 


This interval is the same as the confidence interval for 0; see Example 4.2.1. Hence, 
in this case, the improper prior (11.2.1) leads to the same inference as the traditional 
analysis. 


Example 11.2.2. Usually in a Bayesian analysis, noninformative priors are not 
used if prior information exists. Let us consider the same situation as in Example 
11.2.1, where the model was a N (61, 62) distribution. Suppose now we consider the 
precision 63 = 1/6 instead of variance 62. The likelihood becomes 


(25) exo 5 {tn 1)s? + n(Z— 01)?} 63} , 


so that it is clear that a conjugate prior for 03 is [(a,@). Further, given 03, a 
reasonable prior on 6; is N(6, ao) where no is selected in some way to reflect 
how many observations the prior is worth. Thus the joint prior of 6; and 63 is 


h(O,, 03) x 89~1¢-95/8 (193)1/2e— (91 —90)"03n0/2, 


If this is multiplied by the likelihood function, we obtain the posterior joint pdf of 
6, and 63, namely, 
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where 


Q(1) = : + no(01 — 09)? + [(n — 1)s? + n(% — 6,)?] 
6 No0I9 + NE ‘ 
( ng Fn ) 


Y= : + (n —1)s? + (no? +n~*) "(0 — 2)’. 


If we integrate out 63, we obtain 


= (no +n) +D 


? 


with 


ky (0; |, s”) leq | k (01, 032, s”)d03 
0 


1 


* OG) eee 


To get this in a more familiar form, change variables by letting 
0, _ noPo+nE 


= no+tn 


— /Dii(ro +n) Qatn)] 


with Jacobian \/D/[(no + n)(2a + n)]. Thus 


1 


= 2 
ka(t\Z, 8 ) x oo 


42 
[1 + 2a+n 
which is a Student ¢ distribution with 2a+n degrees of freedom. The Bayes estimate 


(under squared-error loss) in this case is 


N09 + ne 
notn ~ 


It is interesting to note that if we define “new” sample characteristics as 


Nk=No +n 


_ noI9 + nt 
Lk SS 
no +n 
2 D 


ok = 2a+n’ 


then 
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has a t-distribution with 2a + n degrees of freedom. Of course, using these degrees 
of freedom, we can find t,/2 so that 


Sk 
BL OT 


is an HDR credible interval estimate for 6; with probability 1—~y. Naturally, it falls 
upon the Bayesian to assign appropriate values to a, 3,no, and 69. Small values of 
a and no with a large value of @ would create a prior, so that this interval estimate 
would differ very little from the usual one. @ 


ie) 


Finally, it should be noted that when dealing with symmetric, unimodal pos- 
terior distributions, it was extremely easy to find the HDR interval estimate. If, 
however, that posterior distribution is not symmetric, it is more difficult and often 
the Bayesian would find the interval that has equal probabilities on each tail. 


EXERCISES 


11.2.1. Let X,, X2 be a random sample from a Cauchy distribution with pdf 


fi 02 
f(2:01,02) = (=) 4 (e—82” oo < @ < 00, 


where —oo < 4; < 00, 0 < 09. Use the noninformative prior h(61, 62) « 1. 
(a) Find the posterior pdf of 0), 02, other than the constant of proportionality. 


(b) Evaluate this posterior pdf if a, = l,vg = 4 for 6; = 1,2,3,4 and 6) = 
0.5, 1.0, 1.5, 2.0. 


(c) From the 16 values in part (b), where does the maximum of the posterior pdf 
seem to be? 


(d) Do you know a computer program that can find the point (01, 02) of maximum? 


11.2.2. Let X1,X2,...,Xi9 be a random sample of size n = 10 from a gamma 
distribution with a = 3 and 3 = 1/6. Suppose we believe that ? has a gamma 
distribution with a = 10 and @ = 2. 


(a) Find the posterior distribution of 0. 


(b) If the observed F = 18.2, what is the Bayes point estimate associated with 
square-error loss function? 


(c) What is the Bayes point estimate using the mode of the posterior distribution? 


(d) Comment on an HDR interval estimate for 0. Would it be easier to find one 
having equal tail probabilities? 


Hint: Can the posterior distribution be related to a chi-square distribution? 
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11.2.3. Suppose for the situation of Example 11.2.2, 6; has the prior distribution 
N(75,1/(5@3)) and 03 has the prior distribution [(a = 4,3 = 0.5). Suppose the 
observed sample of size n = 50 resulted in 7 = 77.02 and s? = 8.2. 


(a) Find the Bayes point estimate of the mean 0). 
(b) Determine an HDR interval estimate with 1 — y = 0.90. 


11.2.4. Let f(x|0), 0 € Q, be a pdf with Fisher information, (6.2.4), [(@). Consider 
the Bayes model 


X|0 ~ f(rl@), ER 
© ~ AO) x VIO). (11.2.2) 


(a) Suppose we are interested in a parameter T = u(0). Use the chain rule to 


prove that 
06 
V1I(r) = V1(0) = : (11.2.3) 
(b) Show that for the Bayes model (11.2.2), the prior pdf for 7 is proportional to 
JI (r). 


The class of priors given by expression (11.2.2) is often called the class of Jeffreys’ 
priors; see Jeffreys (1961). This exercise shows that Jeffreys’ priors exhibit an 
invariance in that the prior of a parameter 7, which is a function of 0, is also 
proportional to the square root of the information for rT. 


11.2.5. Consider the Bayes model 
Xi\6 ,i=1,2,...,n ~ iid with distribution T(1,6), 6 > 0 
O ~ hi(d—)x-. 


(a) Show that h(@) is in the class of Jeffreys’ priors. 


(b) Show that the posterior pdf is 


1 n+2-1 
nox (z) ew 


where y = oy) Gi. 


(c) Show that if 7 = 01, then the posterior k(rly) is the pdf of a I'(n,1/y) 
distribution. 


(d) Determine the posterior pdf of 2yr. Use it to obtain a (1 — a)100% credible 
interval for @. 


(e) Use the posterior pdf in part (d) to determine a Bayesian test for the hypothe- 
ses Hp : 0 > 09 versus H,: 0 < 09, where 9p is specified. 
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11.2.6. Consider the Bayes model 


Xi\6 ,i=1,2,...,n ~ iid with distribution Poisson (6), 6 > 0 
QO ~ h(d)xal?, 
a ow that is in the class of Jefireys’ priors. 
(a) Sh h h(6) is i he cl f Jeffreys’ pri 


(b) Show that the posterior pdf of 2n@ is the pdf of a x?(2y + 1) distribution, 
where y = oy, Gi. 


(c) Use the posterior pdf of part (b) to obtain a (1 — a)100% credible interval for 
0. 


(d) Use the posterior pdf in part (d) to determine a Bayesian test for the hypothe- 
ses Hy: 0 > 09 versus H,: 0 < 09, where 9p is specified. 


11.2.7. Consider the Bayes model 
Xi | ,i =1,2,...,n ~ iid with distribution b(1,0), 0<0< 1. 

(a) Obtain the Jeffreys’ prior for this model. 

(b) Assume squared-error loss and obtain the Bayes estimate of 0. 
11.2.8. Consider the Bayes model 

Xi|6 ,i=1,2,...,n ~ iid with distribution b(1,0),0<@0<1 
QO ~ A(O)=1. 
(a) Obtain the posterior pdf. 
(b) Assume squared-error loss and obtain the Bayes estimate of 0. 


11.2.9. Let X 1, Xo,...,X, be arandom sample from a multivariate normal normal 
distribution with mean vector = (/11, M2,.--, 4%)’ and known positive definite 
covariance matrix ©. Let X be the mean vector of the random sample. Suppose 
that w has a prior multivariate normal distribution with mean pp and positive 
definite covariance matrix ©. Find the posterior distribution of jz, given X = X. 
Then find the Bayes estimate E(u |X =X). 


11.3. Gibbs Sampler 


From the preceding sections, it is clear that integration techniques play a significant 
role in Bayesian inference. Hence, we now touch on some of the Monte Carlo 
techniques used for integration in Bayesian inference. 

The Monte Carlo techniques discussed in Chapter 5 can often be used to ob- 
tain Bayesian estimates. For example, suppose a random sample is drawn from a 
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2 


N(0,07), where o? is known. Then Y = X is a sufficient statistic. Consider the 


Bayes model 
Y|@ ~ N(6,07/n) 
@ ~ h(O) « b-' exp{—(6 — a)/b}/(1 + exp{—[(6 — a)/b]})*, -00 < 0 < ~, 
a and b > 0 are known, (11.3.1) 


i.e., the prior is a logistic distribution. Thus the posterior pdf is 


ne exp {= 1 (we Oe bate / /(1 + el-ad/hy? 


ey aie ov {- pe hy 1e—(8—a)/b /(1 + e-[@- a)/8))2 dQ 


Assuming squared-error loss, the Bayes estimate is the mean of this posterior distri- 
bution. Its computation involves two integrals, which cannot be obtained in closed 
form. We can, however, think of the integration in the following way. Consider the 
likelihood f(y|@) as a function of 0; that is, consider the function 


K(Aly) = 


w(8) = $(u18) = em exp {5 I 


We can then write the Bayes estimate as 
[2 w(O)b-2e-O-4)/0/(1 4 e-[0-2)/1)2 ag 


E[Ow(9)] 
ERw(O)) ’ (11.3.2) 
where the expectation is taken with O having the logistic prior distribution. 

The estimation can be carried out by simple Monte Carlo. Independently, gen- 
erate 01, O2,...,Om from the logistic distribution with pdf as in (11.3.1). This 
generation is easily computed because the inverse of the logistic cdf is given by 
a+ blog{u/(1—u)}, for 0< uw<1. Then form the random variable, 


ne ye 1 9iw(O i) 
mot) Ta wi) | 


By the Weak Law of Large Numbers (Theorem 5.1.1) and Slutsky’s Theorem (The- 
orem 5.2.4), Tj, — d(y), in probability. The value of m can be quite large. Thus 
simple Monte Carlo techniques enable us to compute this Bayes estimate. Note that 
we can bootstrap this sample to obtain a confidence interval for E{Ow(9)]/E[w(®)]; 
see Exercise 11.3.2. 

Besides simple Monte Carlo methods, there are other more complicated Monte 
Carlo procedures that are useful in Bayesian inference. For motivation, consider 
the case in which we want to generate an observation that has pdf fx(x), but this 
generation is somewhat difficult. Suppose, however, that it is easy to generate both 
Y, with pdf fy(y), and an observation from the conditional pdf fx)y(x|y). As the 
following theorem shows, if we do these sequentially, then we can easily generate 
from fx (x). 


dy) = 


Tm = (11.3.3) 
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Theorem 11.3.1. Suppose we generate random variables by the following algo- 
rithm: 


1. Generate Y ~ fy(y), 
2 Generate X ~ fxyy(xlY). 


Then X has pdf fx (x). 


Proof: To avoid confusion, let T be the random variable generated by the algorithm. 
We need to show that T has pdf fx(«). Probabilities of events concerning T are 
conditional on Y and are taken with respect to the cdf Fy,;y. Recall that proba- 
bilities can always be written as expectations of indicator functions and, hence, for 
events concerning 7’, are conditional expectations. In particular, for any t € R, 


PIT <t = ElFxy(O)] 


[Lf tscein ae srona 
=f [fi tevtetntvenay] ao 
i i fxy(z,y) ay| dx 

3 [ txovae 


Hence the random variable generated by the algorithm has pdf fx (a), as was to be 
shown. 


In the situation of this theorem, suppose we want to determine E[W(X)], for 
some function W(x), where E[W?(X)] < oo. Using the algorithm of the theorem, 
generate independently the sequence (Yi, X1), (Y2, X2),.--,; (Ym; Xm), for a specified 
value of m, where Y; is drawn from the pdf fy(y) and X; is generated from the pdf 
fxjy(x|Y). Then by the Weak Law of Large Numbers, 


Wary x) 2 [” woe)fe(a)de = BW) 


Furthermore, by the Central Limit Theorem, /m(W — E[W(X)]) converges in 
distribution to a N(0, o7,) distribution, where of, = Var(W(X)). If wi, wo,...,Wm 
is a realization of such a random sample, then an approximate (1 — a)100% (large 
sample) confidence interval for E[W(X)] is 


= SW 
We 


© 20/2 ay? 


where s{, = (m—1)7' 7", (wi — 0). 
To set ideas, we present the following simple example. 


(11.3.4) 
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Example 11.3.1. Suppose the random variable X has pdf 


_ f 2e-"(l-—e-*) 0<u4<oo 
fx(a) = { 0 elsewhere. (11.3.5) 


Suppose Y and X|Y have the respective pdfs 


2e-2¥  O0<a4<co 
fyy) = { 0 elsewhere (11.3.6) 
—(a—-y) 
_ e€ y<r<o 
Fxy (ely) ~ { 0) elsewhere. (11.3.7) 


Suppose we generate random variables by the following algorithm: 


1. Generate Y ~ _ fy(y) as in expression (11.3.6). 
2. Generate X ~  fxjy(2|Y) as in expression (11.3.7). 


Then, by Theorem 11.3.1, X has the pdf (11.3.5). Furthermore, it is easy to generate 
from the pdfs (11.3.6) and (11.3.7) because the inverses of the respective cdfs are 
given by Fy-'(u) = —271log(1 — u) and Fey (u) =—log(l—wu)+Y. 

As a numerical illustration, the R function condsim1 (found at the site listed 
in the Preface) uses this algorithm to generate observations from the pdf (11.3.5). 
Using this function, we performed m = 10,000 simulations of the algorithm. The 
sample mean and standard deviation were T = 1.495 and s = 1.112. Hence a 95% 
confidence interval for E(X) is (1.473, 1.517), which traps the true value E(X) = 
1.5; see Exercise 11.3.4. @ 


For the last example, Exercise 11.3.3 establishes the joint distribution of (X,Y) 
and shows that the marginal pdf of X is given by (11.3.5). Furthermore, as shown in 
this exercise, it is easy to generate from the distribution of X directly. In Bayesian 
inference, though, we are often dealing with conditional pdfs, and theorems such as 
Theorem 11.3.1 are quite useful. 

The main purpose of presenting this algorithm is to motivate another algorithm, 
called the Gibbs Sampler, which is useful in Bayes methodology. We describe it 
in terms of two random variables. Suppose (X,Y) has pdf f(x,y). Our goal is to 
generate two streams of iid random variables, one on X and the other on Y. The 
Gibbs sampler algorithm is: 


Algorithm 11.3.1 (Gibbs Sampler). Let m be a positive integer, and let Xo, an 
initial value, be given. Then for i = 1,2,3,...,m, 
1. Generate Y;|Xi-1 ~ f(y|x). 
2. Generate X;|Y; ~ f (aly). 
Note that before entering the 7th step of the algorithm, we have generated X;_1. 


Let x;-; denote the observed value of X;_;. Then, using this value, generate se- 
quentially the new Y; from the pdf f(y|a;-1) and then draw (the new) X; from the 
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pdf f(a|y;), where y; is the observed value of Y;. In advanced texts, it is shown that 


¥, 3 Y~ fry) 
X, 3B X~ fx(x), (11.3.8) 
as 71 — oo, and 
Dy W(X;) 5 E[W(X)], as m— co. (11.3.9) 
i=1 


Note that the Gibbs sampler is similar but not quite the same as the algorithm 
given by Theorem 11.3.1. Consider the sequence of generated pairs 


(ig) (X2, Yo), sale (Xk, Ye), (Xai, Vers): 


Note that to compute (X,41, Y,41), we need only the pair (Xx, Y;,) and none of the 
previous pairs from 1 to k—1. That is, given the present state of the sequence, the 
future of the sequence is independent of the past. In stochastic processes such a 
sequence is called a Markov chain. Under general conditions, the distribution of 
Markov chains stabilizes (reaches an equilibrium or steady-state distribution) as the 
length of the chain increases. For the Gibbs sampler, the equilibrium distributions 
are the limiting distributions in the expression (11.3.8) as i + oo. How large should 
i be? In practice, usually the chain is allowed to run to some large value i before 
recording the observations. Furthermore, several recordings are run with this value 
of i and the resulting empirical distributions of the generated random observations 
are examined for their similarity. Also, the starting value for Xo is needed; see 
Casella and George (1992) for a discussion. The theory behind the convergences 
given in the expression (11.3.8) is beyond the scope of this text. There are many 
excellent references on this theory. A discussion from an elementary level can be 
found in Casella and George (1992). An informative overview can be found in 
Chapter 7 of Robert and Casella (1999); see also Lehmann and Casella (1998). We 
next provide a simple example. 


Example 11.3.2. Suppose (X,Y) has the mixed discrete-continuous pdf given by 


Tan e y>0; 2 =0,1,2,... 


a a 11.3.1 
Hy) a elsewhere, ee) 


for a > 0. Exercise 11.3.5 shows that this is a pdf and obtains the marginal pdfs. 
The conditional pdfs, however, are given by 


f (yl) « yt? 1674 (11.3.11) 

and z 
f(aly) «ew. (1.3.12) 

a! 


Hence the conditional densities are [(a+, 1/2) and Poisson (y), respectively. Thus 
the Gibbs sampler algorithm is, for 7 = 1,2,...,m, 


ilk Generate Y;|Xi-1 ~ T(a + Xj-1, 1/2). 
2. Generate X;|Y; ~ Poisson(Y;). 
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In particular, for large m and n > m, 


Y=(n-m)1 SY, 5 EY) (11.3.13) 
i=m4+1 

X=(n—m)1? SX; 5 E(X). (11.3.14) 
i=m+1 


In this case, it can be shown (see Exercise 11.3.5) that both expectations are equal 
to a. The R function gibbser2.s, found at the site listed in the Preface, computes 
this Gibbs sampler. Using this routine, the authors obtained the following results 
upon setting a = 10, m = 3000, and n = 6000: 


Sample | Sample | Approximate 95% 
Parameter Estimate | Estimate | Variance | Confidence Interval 


E(v¥) =a=10 10.027 | 10.775 (9.910, 10.145) 
E(X)=a=10 10.061 (9.896, 10.225) 


where the estimates Y and & are the observed values of the estimators in expressions 
(11.3.13) and (11.3.14), respectively. The confidence intervals for a are the large 
sample confidence intervals for means discussed in Example 4.2.2, using the sample 
variances found in the fourth column of the above table. Note that both confidence 
intervals trapped a = 10. 


EXERCISES 


11.3.1. Suppose Y has a T'(1,1) distribution while X given Y has the conditional 
pdf 


-(@-y) 9 
_fe <y<2<0o 
fly) = { 0 elsewhere. 


Note that both the pdf of Y and the conditional pdf are easy to simulate. 


(a) Set up the algorithm of Theorem 11.3.1 to generate a stream of iid observations 
with pdf fx(a). 


(b) State how to estimate F(X). 
(c) Using your algorithm found in part (a), write an R function to estimate E(X). 


(d) Using your program, obtain a stream of 2000 simulations. Compute your 
estimate of E(X) and find an approximate 95% confidence interval. 


(e) Show that X has a ['(2,1) distribution. Did your confidence interval trap the 
true value 2? 


11.3.2. Carefully write down the algorithm to obtain a bootstrap percentile con- 
fidence interval for E[Ow(O)|/E[w(©)], using the sample ©), 02,...,Om and the 
estimator given in expression (11.3.3). Write R code for this bootstrap. 
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11.3.3. Consider Example 11.3.1. 
(a) Show that E(X) = 1.5. 


(b) Obtain the inverse of the cdf of X and use it to show how to generate X 
directly. 


11.3.4. Obtain another 10,000 simulations similar to those discussed at the end of 
Example 11.3.1. Use your simulations to obtain a confidence interval for E(X). 


11.3.5. Consider Example 11.3.2. 


(a) Show that the function given in expression (11.3.10) is a joint, mixed discrete- 
continuous pdf. 


(b) Show that the random variable Y has a I'(a, 1) distribution. 


(c) Show that the random variable X has a negative binomial distribution with 
pmf 


“al(aT) 


(ote—Dlo-(at+z) »=0,1,2,... 
p(x) = 
elsewhere. 


(d) Show that E(X) =a. 


11.3.6. Write an R function (or use gibbser2.s) for the Gibbs sampler discussed in 
Example 11.3.2. Run your function for a = 10, m = 3000, and n = 6000. Compare 
your results with those of the authors tabled in the example. 


11.3.7. Consider the following mixed discrete-continuous pdf for a random vector 
(X,Y) (discussed in Casella and George, 1992): 


n\, cta-1 n—2z+B-1 = 
(“)y (1—y) x=0,1,...,n,0<y<1 
Pla, y) x { 0 elsewhere, 


fora > 0 and (@ > 0. 


(a) Show that this function is indeed a joint, mixed discrete-continuous pdf by 
finding the proper constant of proportionality. 


(b) Determine the conditional pdfs f(a|y) and f(y|z). 
(c) Write the Gibbs sampler algorithm to generate random samples on X and Y. 
(d) Determine the marginal distributions of X and Y. 


11.3.8. Write an R function for the Gibbs sampler of Exercise 11.3.7. Run your 
program for a = 10, G = 4, m = 3000, and n = 6000. Obtain estimates (and confi- 
dence intervals) of E(X) and E(Y) and compare them with the true parameters. 
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11.4 Modern Bayesian Methods 


The prior pdf has an important influence in Bayesian inference. We need only 
consider the different Bayes estimators for the normal model based on different 
priors, as shown in Examples 11.1.3 and 11.2.1. One way of having more control 
over the prior is to model the prior in terms of another random variable. This is 
called the hierarchical Bayes model, and it is of the form 


X|0 ~  f(z|6) 
Oly ~ Ally) 
To ~ (7). (11.4.1) 


With this model we can exert control over the prior h(@|y) by modifying the pdf 
of the random variable . A second methodology, empirical Bayes, obtains an 
estimate of 7 and plugs it into the posterior pdf. We offer the reader a brief introduc- 
tion of these procedures in this section. There are several good books on Bayesian 
methods. In particular, Chapter 4 of Lehmann and Casella (1998) discusses these 
procedures in some detail. 

Consider first the hierarchical Bayes model given by (11.4.1). The parameter 7 
can be thought of a nuisance parameter. It is often called a hyperparameter. As 
with regular Bayes, the inference focuses on the parameter 6; hence, the posterior 
pdf of interest remains the conditional pdf k(@|x). 

These discussions often involve several pdfs; hence, we frequently use g as a 
generic pdf. It will always be clear from its arguments what distribution it repre- 
sents. Keep in mind that the conditional pdf f(x|0) does not depend on 4; hence, 


g(x, 9,7) 
g(x) 
g(x, y)9(9, 7) 
g(x) 
L(x|OROly) vo) 
g(x) , 


9(9, y|x) 


Therefore, the posterior pdf is given by 


°° 0)h(6 d 
k(8|x) = Foo FMA ay (11.4.2) 
Soro Frc0 FI) R(Ol 7) (a) aya 
Furthermore, assuming squared-error loss, the Bayes estimate of W() is 
f(x|O)h(6 dydé 
Sw(x) = $% i (x|A)h(ly)b(q) dy ana 


ke = TIRE (A|-y)eb(-y) dydo 


Recall that we defined the Gibbs sampler in Section 11.3. Here we describe it 
to obtain the Bayes estimate of W(@). For 1 = 1,2,...,m, where m is specified, the 
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ith step of the algorithm is 
Oi|x,yi-1 ~~ 9 (|x, Ys-1) 
Ti|x,0 ~ (71x, 4). 
Recall from our discussion in Section 11.3 that 
0; 3 k(O\x) 
D 
ry = oye) 


as i — oo. Furthermore, the arithmetic average 
1 m 
— >> Wi) +, E[W(@)|x] = dw(x) as m > co. (11.4.4) 
i=1 


In practice, to obtain the Bayes estimate of W(@) by the Gibbs sampler, we 
generate by Monte Carlo the stream of values (61,71), (62, y2).... Then choosing 
large values of m and n* > m, our estimate of W(0) is the average, 


1 
: S> WG): (11.4.5) 
n—m., 

i=m+1 


Because of the Monte Carlo generation these procedures are often called MCMC, 
for Markov Chain Monte Carlo procedures. We next provide two examples. 


Example 11.4.1. Reconsider the conjugate family of normal distributions dis- 
cussed in Example 11.1.3, with 09 = 0. Here we use the model 


— o2 
X|O ~ N ( =) , o7 is known 
n 
O72? ~ N(0,7?) 
— T'(a,b), a and b are known. (11.4.6) 
To set up the Gibbs sampler for this hierarchical Bayes model, we need the condi- 
tional pdfs g(0|Z, 7?) and g(7?|Z, 0). For the first, we have 
g(9|E,77) x f(E|A)AG|r? (7). 
As we have been doing, we can ignore standardizing constants; hence, we need 
only consider the product f(Z|@)h(0|77). But this is a product of two normal pdfs 
which we obtained in Example 11.1.3. Based on those results, g(0|%,77) is the pdf 
of a N({r?/[(o?/n) + 77] }2, (7207) /[o? + nr?]). For the second pdf, by ignoring 
standardizing constants and simplifying, we obtain 


a(Zire) x F(tie)a(olr2yo0/r4 
« fool-aa}(z) {a3} 


1 at+(1/2)-1 1 2 1 
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which is the pdf of a '{a + (1/2), [(6?/2) + (1/b)]~*} distribution. Thus the Gibbs 
sampler for this model is given by: 


1 1 (0? 1\7~* 
|g, ~ T cakes 11.4.8 
a (+5 (G++) ( ) 
for 1 = 1,2,...,m. As discussed above, for a specified values of large m and n* > 
m, we collect the chain’s values ((Om, Tm); (Qm+41;7m+1);---;(On*,7n*)) and then 
obtain the Bayes estimate of 6 (assuming squared-error loss): 
a > Q; (11.4.9) 
7 n*—m i=m4+1 7 - 


The conditional distribution of O given % and 7;_1, though, suggests the second 
estimate given by 


a . TT Ho 9 (1.4.10) 
7 n—m t=m+1 . as (o2/n) -_ 


Example 11.4.2. Lehmann and Casella (1998, p. 257) presented the following 
hierarchical Bayes model: 


X|\ ~  Poisson(A) 
Alb ~ T(1,0) 
B w~ g(b)=7 'b-*exp{—1/br}, b>0,7 >0. 
For the Gibbs sampler, we need the two conditional pdfs, g(A|x,b) and g(b|z, A). 
The joint pdf is 
g(a, A, b) = F(a|A)h(A|b)v (0). (1.4.11) 
Based on the pdfs of the model, (11.4.11), for the first conditional pdf we have 


a 
g(Alz,b) «x e rae ate 


x Attl=1e-AL+0/2)] (11.4.12) 


which is the pdf of a [(a + 1,6/[b+ 1]) distribution. 
For the second conditional pdf, we have 


1 
g(bla,A) «x Be rare 
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In this last expression, making the change of variable y = 1/b which has the Jacobian 
db/dy = —y~?, we obtain 


1 = 
g(y|z,A) x wPexp {—y > +a) by . 


= 1+ Ar 
x ff 


which is easily seen to be the pdf of the ['(2,7/[Ar + 1]) distribution. Therefore, 
the Gibbs sampler is, for i = 1,2,...,m, where m is specified, 


Aj la, bi—-1 oS T (a2 +1, bj;-1/[1 + b;-1]) 
B;=Y,-', where Y;|x,\; ~ T(2,7/[i7+1]). o 

As a numerical illustration of the last example, suppose we set tT = 0.05 and 
observe « = 6. The R function! hierarch1.s computes the Gibbs sampler given in 
the example. It requires specification of the value of 7 at which the Gibbs sample 
commences and the length of the chain beyond this point. We set these values at 
m = 1000 and n* = 4000, respectively, i.e., the length of the chain used in the 
estimate is 3000. To see the effect that varying 7 has on the Bayes estimator, we 
performed five Gibbs samplers, with these results: 


7 | 0.040 0.045 0.050 0.055 0.060 
56 | 6.600 6.490 6.530 6.500 6.440 
There is some variation. As discussed in Lehmann and Casella (1998), in general, 


there is less effect on the Bayes estimator due to variability of the hyperparameter 
than in regular Bayes due to the variance of the prior. 


11.4.1 Empirical Bayes 


The empirical Bayes model consists of the first two lines of the hierarchical Bayes 
model; i.e., 


X|9 ~  f(x|A) 
Oly ~ h(6l7). 


Instead of attempting to model the parameter y with a pdf as in hierarchical Bayes, 
empirical Bayes methodology estimates y based on the data as follows. Recall that 


g(x, 9,7) 
vy) 
F(xXlMROly)vy) 
v(7) 
f(x|@)h(6|y). 


Downloadable at the site listed in the Preface 


g(x, Aly) _ 


l| 
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Consider, then, the likelihood function 
m(x|7) =] f (x|@)h(4|y) dé. (11.4.13) 


Using the pdf m(x|y), we obtain an estimate 7 = ¥(x), usually by the method 
of maximum likelihood. For inference on the parameter 6, the empirical Bayes 
procedure uses the posterior pdf k(6|x,74). 

We illustrate the empirical Bayes procedure with the following example. 


Example 11.4.3. Consider the same situation discussed in Example 11.4.2, except 
assume that we have a random sample on X; i.e., consider the model 


XijA, i=1,2,...,n ~ iid Poisson(A) 
Ae ae, DOLD: 
Let X = (Xj, Xo,..., Xn)’. Hence, 
g(x|A) = ———e™, 
where T= n~')~""_, x;. Thus, the pdf we need to maximize is 
(xp) =f g(alayn(a|o) ar 
0 
_ - 1 yntti-1,—narl .—r/b da 
9 «!--- zp! b 


T(nz + 1)[b/(nb + 1))"7+! 


Xy!+++Ly!b 


Taking the partial derivative of log m(x|b) with respect to b, we obtain 


O log m(x|b) 1 1 
———_—— = -- z+ 1)——__.. 
ab pi 
Setting this equal to 0 and solving for b, we obtain the solution 
b=T. (11.4.14) 


To obtain the empirical Bayes estimate of A, we need to compute the posterior pdf 
with b substituted for b. The posterior pdf is 


K(Alx,b) ox g(x|A)h(Alb) 
ox \nBt+1-1¢e-Aln+(1/b)] (1.4.15) 
which is the pdf of a I'(n¥ + 1,b/ [nb + 1}) distribution. Therefore, the empirical 


Bayes estimator under squared-error loss is the mean of this distribution; i.e., 


\ = [nz + 1] —— =F, (11.4.16) 


since b = Z. Thus, for the above prior, the empirical Bayes estimate agrees with 
the mle. m 
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We can use our solution of this last example to obtain the empirical Bayes 
estimate for Example 11.4.2 also, for in this earlier example, the sample size is 1. 
Thus, the empirical Bayes estimate for \ is x. In particular, for the numerical case 
given at the end of Example 11.4.2, the empirical Bayes estimate has the value 6. 


EXERCISES 


11.4.1. Consider the Bayes model 


1 
Xi\6 ~ iid TD (. 3) 
ela ~ T(2,8). 
By performing the following steps, obtain the empirical Bayes estimate of 6. 


(a) Obtain the likelihood function 
m(x|3) =f Fle|8)h(0)3) a. 
0 


(b) Obtain the mle B of @ for the likelihood m(x|@). 


c) Show that the posterior distribution of © given x and B is a gamma distribu- 
& & 
tion. 


(d) Assuming squared-error loss, obtain the empirical Bayes estimator. 
11.4.2. Consider the hierarchical Bayes model 


Y ~ O(n, p), O0<p<1 
pid ~ h(pl@) =Op*", 0>0 
06 ~ T(1,a), a> 0 is specified. (11.4.17) 
(a) Assuming squared-error loss, write the Bayes estimate of p as in expression 
(11.4.3). Integrate relative to @ first. Show that both the numerator and 


denominator are expectations of a beta distribution with parameters y + 1 
andn—y+l. 


(b) Recall the discussion around expression (11.3.2). Write an explicit Monte 
Carlo algorithm to obtain the Bayes estimate in part (a). 


11.4.3. Reconsider the hierarchical Bayes model (11.4.17) of Exercise 11.4.2. 


(a) Show that the conditional pdf g(ply, 0) is the pdf of a beta distribution with 
parameters y+ 46 andn—y+l. 


(b) Show that the conditional pdf g(@|y,p) is the pdf of a gamma distribution 
with parameters 2 and [+ — logp] re 
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(c) Using parts (a) and (b) and assuming squared-error loss, write the Gibbs 
sampler algorithm to obtain the Bayes estimator of p. 


11.4.4. For the hierarchical Bayes model of Exercise 11.4.2, set n = 50 and a = 2. 
Now, draw a @ at random from a (1,2) distribution and label it 6*. Next, draw a 
p at random from the distribution with pdf 6*p? ~! and label it p*. Finally, draw 
a y at random from a b(n, p*) distribution. 


(a) Setting m at 3000, obtain an estimate of 0* using your Monte Carlo algorithm 
of Exercise 11.4.2. 


(b) Setting m at 3000 and n* at 6000, obtain an estimate of 6* using your Gibbs 
sampler algorithm of Exercise 11.4.3. Let p3001, p3002,---, P6000 denote the 
stream of values drawn. Recall that these values are (asymptotically) simu- 
lated values from the posterior pdf g(p|y). Use this stream of values to obtain 
a 95% credible interval. 


11.4.5. Write the Bayes model of Exercise 11.4.2 as 


Y ~ O(n,p), O<p<i1 
pila ~ h(pl0) =Op’', 6>0. 


Set up the estimating equations for the mle of g(y|@), i.e., the first step to obtain 
the empirical Bayes estimator of p. Simplify as much as possible. 


11.4.6. Example 11.4.1 dealt with a hierarchical Bayes model for a conjugate family 
of normal distributions. Express that model as 


= o2 

X|O ~ N (0 ~) , o7 is known 
n 

O|7? ~ N(0,77). 


Obtain the empirical Bayes estimator of 0. 
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Appendix A 


Mathematical Comments 


A.1 Regularity Conditions 


These are the regularity conditions referred to in Sections 6.4 and 6.5 of the text. 
A discussion of these conditions can be found in Chapter 6 of Lehmann and Casella 
(1998). 

Let X have pdf f(x;0@), where 80 € 0 C R?. For these assumptions, X can be 
either a scalar random variable or a random vector in R*. As in Section 6.4, let 
1(@) = [L;x] denote the p x p information matrix given by expression (6.4.4). Also, 
we will denote the true parameter @ by 0. 


Assumptions A.1.1. Additional regularity conditions for Sections 6.4 and 6.5. 


(R6): There exists an open subset Qo C Q such that Oo € Qo and all third partial 
derivatives of f(a;0@) exist for all 8 € Qo. 


(R7) The following equations are true (essentially, we can interchange expectation 
and differentiation): 


Eg Yai ene) = 0, forj=1,...,p 
06; 


0? 


(R8) For all 8 € Qo, 1(A) is positive definite. 
(R9) There exist functions Mji1(x) such that 


3 


=a anc 30)| < Mjxi(z), UaeQ 
50;00,6 O81 || < jai), for all 8 € Qo, 


and 
Eg, [Myx] <oo, forall j,k,l€1,...,p. = 


687 
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A.2 Sequences 


The following is a short review of sequences of real numbers. In particular the 
liminf and limsup of sequences are discussed. As a supplement to this text, the 
authors offer a mathematical primer which can be downloaded at the site listed in 
the Preface. In addition to the following review of sequences, it contains a brief 
review of infinite series, and differentiable and integrable calculus including double 
integration. Students that need a review of these concepts can freely download this 
supplement. 

Let {a,,} be a sequence of real numbers. Recall from calculus that a, — a 
(limp—oo Gn = a) if and only if 


for every € > 0, there exists an No such that n >No = > |an—al<e. (A.2.1) 


Let A be a set of real numbers that is bounded from above; that is, there exists 
an M € R such that « < M for all x € A. Recall that a is the supremum of A if 
a is the least of all upper bounds of A. From calculus, we know that the supremum 
of a set bounded from above exists. Furthermore, we know that a is the supremum 
of A if and only if, for all « > 0, there exists an x € A such that a—¢€ <a <a. 
Similarly, we can define the infimum of A. 

We need three additional facts from calculus. The first is the Sandwich Theorem. 


Theorem A.2.1 (Sandwich Theorem). Suppose for sequences {an}, {bn}, and 
{cn} that cn < an < bn, for all n, and that limp—oo by = limp Cn = a. Then 
limp. +66 Qa. = 0: 


Proof: Let « > 0 be given. Because both {b,,} and {c,,} converge, we can choose No 
so large that |c, — a| < € and |b, — a| < ¢, for n > No. Because cp < an < by, it is 
easy to see that 

|an — a| < max{|cn — al, |bn — al}, 


for all n. Hence, ifn > No, then |a, —a| <e. @ 


The second fact concerns subsequences. Recall that {a,, } is a subsequence of 
{a,} if the sequence ny < ng <--- is an infinite subset of the positive integers. 
Note that nz > k. 


Theorem A.2.2. The sequence {a,} converges to a if and only if every subsequence 
{a@n, } converges to a. 


Proof: Suppose the sequence {a,,} converges to a. Let {ay,} be any subsequence. 
Let ¢ > 0 be given. Then there exists an No such that |a, —a| < ¢, for n > No. 
For the subsequence, take k’ to be the first index of the subsequence beyond No. 
Because for all k, nz > k, we have that ny, > ny > k’ > No, which implies that 
lan, — a| <¢€. Thus, {a,,} converges to a. The converse is immediate because a 
sequence is also a subsequence of itself. m 


Finally, the third theorem concerns monotonic sequences. 
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Theorem A.2.3. Let {a,} be a nondecreasing sequence of real numbers; i.e., for 
all n, Gn < Gn41. Suppose {ay} is bounded from above; i.e., for some M € R, 
Gn <M for alln. Then the limit of ay exists. 


Proof: Let a be the supremum of {a,,}. Let € > 0 be given. Then there exists an 
No such that a —€ < ay, <a. Because the sequence is nondecreasing, this implies 
that a—€ <a, <a, for alln > No. Hence, by definition, a, — a. m 

Let {an} be a sequence of real numbers and define the two subsequences 


bn sup{Qn,Qn41,---}, m=1,2,3... (A.2.2) 
Cy = Antla,anetee sl USL, 253: (A.2.3) 


It is obvious that {b,,} is a nonincreasing sequence. Hence, if {a,,} is bounded from 
below, then the limit of b, exists. In this case, we call the limit of {b,} the limit 
supremum (limsup) of the sequence {a,,} and write it as 


lim a, = lim bp. (A.2.4) 
Note that if {a,} is not bounded from below, then lim;—.. @, = —0oo. Also, if 


{an} is not bounded from above, we define limp... dn = 00. Hence, the lim of any 
sequence always exists. Also, from the definition of the subsequence {b,,}, we have 


Oy Sings WHS 2 eB ak (A.2.5) 


On the other hand, {c,,} is a nondecreasing sequence. Hence, if {a,,} is bounded 
from above, then the limit of c,, exists. We call the limit of {c,,} the limit infimum 
(liminf) of the sequence {a,,} and write it as 


lim ay, = lim Cp. (A.2.6) 


noo n—oo 


Note that if {a,,} is not bounded from above, then lim 


noo Gn = 00. Also, if {a,} is 
not bounded from below, lim, _,.. an = —oo. Hence, the lim of any sequence always 


exists. Also, from the definition of the subsequences {c,,} and {b,}, we have 
Ca < Gn < bn, = 1,2,3,.... (A.2.7) 


Also, because Cy < by for all n, we have 


lim a, < lim ay. © (A.2.8) 


n—-0o n—-oo 


Example A.2.1. Here are two examples. More are given in the exercises. 


1. Suppose a, = —n for all n = 1,2,.... Then b, = sup{—n,—n—1,...} = 
—n — —o0 and c, = inf{—n,—n—-1,...} = —o0 — —oo. So, lim, .., an = 
limpn—sco Gn = —OO. 
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2. Suppose {a,} is defined by 


_ 1++ if n is even 
on 244. if nis odd. 
Then {b,,} is the sequence {3, 2+(1/3),2+(1/3),2+(1/5),2+(1/5),...}, which 
converges to 2, while {c,} = 1, which converges to 1. Thus, lim a= 1 
and limp—+oo Gn = 2. 


N— Co 


It is useful that the lim, ,., and lim, of every sequence exists. Also, the 
sandwich effects of expressions (A.2.7) and (A.2.8) lead to the following theorem. 


Theorem A.2.4. Let {a,} be a sequence of real numbers. Then the limit of 
{a,} exists if and only if lim Gn = limMp—soo Gn, in which case, limn oo dn = 


71— co 
lim, .45 dn = limn—oo An- 


Proof: Suppose first that lim,—..G@n = a. Because the sequences {c,} and {b,,} 
are subsequences of {a,,}, Theorem A.2.2 implies that they converge to a also. 
Conversely, if lim, ,.. @n = limn—oo Gn, then expression (A.2.7) and the Sandwich 
Theorem, A.2.1, imply the result. m 


Based on this last theorem, we have two interesting applications that are fre- 
quently used in statistics and probability. Let {p,,} be a sequence of probabilities 
and let b, = sup{pn, Pn4i,---} and cy, = inf{pn, Pn4i,--.}. For the first application, 
suppose we can show that limy—.oo Pn = 0. Then, because 0 < pp < bn, the Sand- 
wich Theorem implies that lim,_... pn = 0. For the second application, suppose we 
can show that lim, _... Pn = 1. Then, because cy, < pn < 1, the Sandwich Theorem 
implies that limp... pn = 1. 

We list some other properties in a theorem and ask the reader to provide the 
proofs in Exercise A.2.2: 


Theorem A.2.5. Let {a,} and {d,} be sequences of real numbers. Then 


lim (an +dn) < Tim a,+ lim d, (A.2.9) 
lim a, = — lim (-a,). (A.2.10) 


EXERCISES 


A.2.1. Calculate the lim and lim of each of the following sequences: 
(—1)" (2-4). 


(b) For n =1,2,..., dn = nos"n/2), 


(a) For n=1,2,..., dn 


(c) Forn =1,2,..., an = ++ cos + (-1)”. 
A.2.2. Prove properties (A.2.9) and (A.2.10). 
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A.2.3. Let {a,} and {d,} be sequences of real numbers. Show that 


lim (4) +dn) 2 lim a, + lim dp. 


noo noo noo 


A.2.4. Let {a,,} be a sequence of real numbers. Suppose {an, } is a subsequence of 


{an}. If {an, } — ao as k > ov, show that lim, .., an < do < limnsoo Gn. 
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Appendix B 


R Primer 


The package R can be downloaded at CRAN (https://cran.r-project.org/). It is 
freeware and there are versions for most platforms including Windows, Mac, and 
Linux. To install R simply follow the directions at CRAN. Installation should only 
take a few minutes. For more information on R, there are free downloadable manuals 
on its use at the CRAN website. There are many reference texts that the reader 
can consult, including the books by Venables and Ripley (2002), Verzani (2014), 
Crawley (2007), and Chapter 1 of Kloke and McKean (2014). 

Once R is installed, in Windows, click on the R icon to begin an R session. The 
R prompt is a >. To exit R, type qQ, which results in the query Save workspace 
image? [y/n/c]:. Upon typing y, the workspace will be saved for the next session. 
R has a built-in help (documentation) system. For example, to obtain help on the 
mean function, simply type help (mean). To exit help, type q. We would recommend 
using R while working through the sections in this primer. 


B.1 Basics 


The commands of R work on numerical data, character strings, or logical types. To 
separate commands on the same line, use semicolons. Also, anything to the right of 
the symbol # is disregarded by R; i.e., to the right of # can be used for comments. 
Here are some arithmetic calculations: 


> 846 - 7#2 

[1] 0 

> (150/3) + 7°2 -1 ; sqrt (50) - 50°(1/2) 
[1] 98 

[1] 0 


> (4/3) *pi*5~3 # The volume of a sphere with radius 5 
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[1] 523.5988 
> 2*pi*d # The circumference of a sphere with radius 5 
[1] 31.41593 


Results can be saved for later calculation by either the assignment function <- or 
equivalently the equal symbol =. Names can be a mixture of letters, numbers, or 
symbols. For example: 


>xr <- 10 ; Vol <- (4/3)*pi*r73 ; Vol 
[1] 4188.79 
> xr = 100 ; circum = 2*pi*r ; circum 
[1] 628.3185 


Variables in R include scalars, vectors, or matrices. In the last example the variables 
r and Vol are scalars. Scalars can be combined into vectors with the c function. Fur- 
ther, arithmetic functions on vectors are performed componentwise. For instance, 
here are two ways to compute the volumes of spheres with radii 5,6,...,9. 


>r <- c(5,6,7,8,9) ; Vol <- (4/3)*pitr73 ; Vol 

[1] 523.5988 904.7787 1436.7550 2144.6606 3053.6281 
>r <- 5:9 ; Vol <- (4/3)*pi*r73 ; Vol 

[1] 523.5988 904.7787 1436.7550 2144.6606 3053.6281 


Components of a vector are referred to by using brackets. For example, the 5th 
component of the vector vec is vec[5]. Matrices can be formed from vectors using 
the commands rbind (combine rows) and cbind (combine columns) on vectors. To 
illustrate let A and B be the matrices 


1 4 1 3 5 7 
A=| 55 >] eB=| 5 4 6 ar 
Then AB, A7!, and B’A are computed by 
> cl <- c(1,3) ; c2 <- c(4,2); a <- cbind(c1,c2) 
> ri <- c(1,3,5,7); r2 <- c(2,4,6,8); b <- rbind(ri,r2) 
> af*%b; solve(a) ; t(bdh*ha 
[,1] [,2] [,3] [,4] 


[1,] 9 19 29 39 
[2,] 7 17 427 ~ 37 


[,1] (,2] 
ci -0.2 0.4 
ce? 0.3 -O0.4 
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cl ¢2 
[1,] 7 8 
[2,] 15 20 
[3,] 23 32 
[4,] 31 44 


Brackets are also used to refer to elements of matrices. Let amat be a 4 x 4 matrix. 
Then the (2,3) element is amat [2,3] and the upper right corner 2 x 2 submatrix is 
amat [1:2,3:4]. This last item is an example of subsetting of a matrix. Subsetting 
is easy in R. For example, the following commands obtain the negative, positive, 
and elements of 0 for a vector x: 


> x = c(-2,0,3,4,-7,-8,11,0); xn = x[x<O]; xn 
[1] -2 -7 -8 

> xp = x[x>0]; xp 

[1] 3 4 11 

> xO = x[x==0]; x0 

[1] 0 0 


For R vectors x and y of the same length, the plot of y versus x is obtained by 
the command plot(y ~ x). The following segment of R code obtains plots found 
in Figure 2.1.1 of the volume and circumference of the sphere versus the radius 
for a sequence of radii from 0 to 8 in steps of 0.1. The first plot is a simple plot; 
the second plot adds some labeling and a title; the third plot draws a curve of the 
relationship; and the fourth plot shows the relationship between the circumference 
of the circle versus the radius. 


par (mfrow=c(2,2)) # This sets up a 2 by 2 page of plots 

r <- seq(0,8,.1); Vol <- (4/3)*pi*r73 ; plot(Vol ~ r) # Plot 1 
title("Simple Plot") 

plot(Vol ~ r,xlab="Radius",ylab="Volume") # Plot 2 
title("Volume vs Radius") 

plot(Vol ~ r,pch=" ",xlab="Radius", ylab="Volume") 

lines(Vol ~ r) # Plot 3 


title("Curve") 

circum <- 2*pix*r 

plot(circum ~ r,pch=" ",xlab="Radius",ylab="Circumference") 
lines(circum ~ r); title("Circumference vs Radius") # Plot 4 
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Simple Plot Volume vs Radius 
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Figure 2.1.1: Spherical Plots discussed in Text. 


B.2 Probability Distributions 


For many distributions, R has functions that obtain probabilities, compute quan- 
tiles, and generate random variates. Here are two common examples. Let X be a 
random variable with a N(, 07) distribution. In R, let mu and sig denote the mean 
and standard deviation of X, respectively. Then the R commands and meanings 
are: 


pnorm(x,mu,sig) | P(X <2). 
qnorm(p,mu,sig) | P(X <q) =p. 


dnorm(x,mu,sig) | f(x), where f is the pdf of X. 
n variates generated from distribution of X. 


As a numerical illustration, suppose the height of a male is normally distributed 
with mean 70 inches and standard deviation 4 inches. 


> 1-pnorm(72, 70,4) # Prob. man exceeds 6 foot in ht. 
[1] 0.3085375 
> qnorm(.90,70,4) # The upper 10th percentile in ht. 
[1] 75.12621 


> dnorm(72,70,4) # value of density at 72 
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[1] 0.08801633 
> rnorm(6, 70,4) # sample of size 6 on X 
[1] 72.12486 75.25811 71.26661 63.36465 74.19436 69.71513 


For the next figure, 2.2.2, we generate 100 variates, histogram the sample, and 
overlay the plot of the density of X on the histogram. Note the pr=T argument in 
the histogram. This scales the histogram to have area 1. 


> x = rnorm(100,70,4); x=sort(x) 

> hist (x, pr=T,main="Histogram of Sample") 
> y = dnorm(x,70,4) 

> lines (y~x) 


Histogram of Sample 
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Figure 2.2.2: Histogram of a Random Sample from a N(70, 4”) distribution over- 
laid with the pdf of this normal. 


For a discrete random variable the pdf is the probability mass function (pmf). 
Suppose X is binomial with 100 trials and 0.6 as the probability of success. 
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> pbinom(55,100, .6) # Probability of at most 55 successes 
[1] 0.1789016 
> dbinom(55,100,.6) # Probability of exactly 55 successes 
[1] 0.04781118 


Most other well known distributions are in core R. For example, here is the 
probability that a x? random variable with 30 degrees of freedom exceeds 2 standard 
deviations form its mean, along with a I-distribution confirmation. 


> mu=30; sig=sqrt(2*mu); 1-pchisq(mu+2*sig, 30) 
[1] 0.03471794 

> 1-pgamma(mut+2*sig,15,1/2) 

[1] 0.03471794 


The sample command returns a random sample from a vector. It can ei- 
ther be sampling with replacement (replace=T) or sampling without replacement 
(replace=F). Here are samples of size 12 from the first 20 positive integers. 


> vec = 1:20 
> sample(vec,12,replace=T) 


[1] 14 20 717 6 61111 9 1 10 14 
> sample (vec,12,replace=F) 


[1] 12 114 5 411 3 17 16 19 20 15 


B.3 R Functions 


The syntax for R functions is the same as the syntax in R. This easily allows for 
the development of packages, a collection of R functions, for specific tasks. The 
schematic for an R function is 


name-function <- function(arguments) { 
body of function 
} 


Example B.3.1. Consider a process where a measurement is taken over time. At 
each time n, n = 1,2,..., the measurement x, is observed but only the sample 
mean Zp, = (1/n) *"_, v; of the measurements at time n is recorded and the point 
(n,Z,) is added to the running plot of sample means. How is this possible? There 
is a simple update formula for the sample mean that is easily derived. It is given by 


nm 


1 
En + ——2n41;3 (B.3.1) 


pt n+l 
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hence, the sample mean for the sequence 21,...,£@n,41 can be expressed as a linear 
combination of the sample mean at time n and the (n + 1)st measurement. The 
following R function codes this update formula: 


mnupdate <- function(n,xbarn,xnp1){ 
# Input: n is sample size; xbarn is mean of sample of size n; 
# xnp1 is (nt+1) (new) observation 
# OQutput: mean of sample of size (n+1) 
mnupdate <- (n/(n+1))*xbarn + xnp1/(n+1) 
return (mnupdate) 


+ 


To run this function we first source it in R. If the function is in the file mnupdate.R 
in the current directory then the source command is source("mnupdate.R"). It 
can also be copied and pasted into the current R session. Here is an execution of it: 


> source ("mnupdate.R") 
> x = c(3,5,12,4); n=4; xbarn = mean(x); 
> xX; xbarn #0ld sample and its mean 


[1] 3 512 4 
[1] 6 


> xp1 = 30 # New observation 
> mnupdate(n, xbarn, xp1) # Mean of updated sample 


[1] 10.8 
| 
B.4 Loops 


Occasionally in the text, we use a loop in an R program to compute a result. Usually 
it is a simple for loop of the form 


for(i in i:n){ 
. R code often as a function of i 
# For the n-iterations of the loop, i runs through 
# the values i=1, i=2, ... , i=n. 


i 


For example, the following code segment produces a table of squares, cubes, square- 
roots, and cube-roots, for the integers from 1 to n. 


# set n at some value 
tab <- c() # Initialize the table 
for(i in i:n){ 
tab <- rbind(tab,c(i,i72,i°3,i7 (1/2) ,i7 (1/3))) 
} 
tab 
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B.5 Input and Output 


Many texts on R, including the references cited above, have information on input 
and output (I/O) in R. We only discuss several ways which are useful for the R 
discussion in our text. For output, we discuss two commands. The first writes an 
array (matrix) to a text file. Suppose amat is a matrix with p columns. Then the 
command write(t(amat) ,ncol=p,file="amatrix.dat") writes the matrix amat 
to the text file amatrix.dat in the current directory. Simply put the “Path” before 
the file as file="Path/amatrix.dat" to send it to another directory. The second 
way writes out variables to an R object file called an “rda” file. The variables can 
include scalars, vectors, matrices, and strings. For example the next line of code 
writes to an rda file the scalars avar and bscale and the matrix amat along with 
an information string. 


info <- "This file contains the variable ..... 
save(avar,bscale,amat,info,file="try.rda") 


The command load("try.rda") will load these variables (names and values) into 
the current session. Most of the data sets discussed in the text are in rda files. 

For input, we have already discussed the c and load functions. The c function 
is tedious, though, and a much easier way is to use the scan function. For example, 
the following lines of code assign the vector (1, 2,3) to x: 


x <- scan() 
A 2 
3 


The separator between values is white space and the empty line after the data 
signals the end of x’s values. Note that this allows data to be copied and pasted 
into R. A matrix can also be scanned similarly by using the read.table function; 
for example, the following command inputs the above matrix A with column header 
“cl” and “c2”: 


a <- read.table(header = TRUE, text = " 


ci c2 
1 4 
3 2 
") 


Notice that copy and paste is also easily used with this command. If the matrix A 
is in the file amat.dat with no header, it can be read in as 


a <- matrix(scan("amat.dat") ,ncol=2,byrow=T) 


B.6 Packages 


An R package is a collection of R functions designed for specified tasks. For example, 
in Chapter 10, the packages Rfit and npsm are discussed that compute rank-based 
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robust and nonparametric procedures. There are thousands of free packages avail- 
able to users at the site CRAN. The package hmcpkg contains all the R functions 
and R data sets discussed in this text. It can be downloaded at the site: 
http://www.stat.wmich.edu/mckean/hmchomepage/Pkg/ 

Once it is installed on your computer use the library command as shown next to use 
the package in an R session. The next segment of code prints out the first 3 lines 
of the baseball data set discussed in Example 4.2.4. The attach command allows us 
to access the variables of the data set, as we show for the variable height. 


library (hmcpkg) 

head (bb, 3) 

hand height weight hitind hitpitind average 

1 1 74 218 1 0 3.330 
2 0) 75 185 1 1 0.286 
3 i 77 219 2 0 3.040 


attach(bb); head(height,4) # accessing the variable height 

[1] 74 75 77 73 
In Example 1.3.3, the derivation of the probability that in a group of n people at 
least 2 have the same birthday is given. The R function bday, included in the 
package, computes this probability. The following segment of code computes it for 
a group of size 10. 

library (hmcpkg) 

bday (10) 

[1] 0.1169482 
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Appendix C 


Lists of Common 
Distributions 


In this appendix, we provide a short list of common distributions. For each distribu- 
tion, we note the expression where the pmf or pdf is defined in the text, the formula 
for the pmf or pdf, its mean and variance, and its mgf. The first list contains 
common discrete distributions, and the second list contains common continuous 
distributions. 
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List of Common Discrete Distributions 


Bernoulli (3.1.1 
O0<p<l p(z) = p*(1—p)-*, 2=0,1 

=p, 0° =p(1—p) 

m(t) = [(l1—p)+pe'], —coo<t<oo 
Binomial (3.1.2) 
O0<p<l p(x) = (®)p"(1—p)*"*, i Ha 0 Le eee 
Ma, 2) acs 

w=np, o? =np(l—p) 

m(t) = [((l—p)+pe']", -co<t<oo 
Geometric (3.1.4) 
O0<p<l p(x) = p(l—p)*, «=0,1,2, 


m(t) =pll—(1—p)e], t<—log(1 —p) 


Hypergeometric (N,D,n) (3.1.7) 
N-D D 
n=1,2,...,min{N, D} p(x) = (riz )(2) Hea Ol Be? near 1) 


D .9 D N— DN=n 
P=ne, OF =n 


The above pmf is the piobabiig of obtaining « Ds 
in a sample of size n, without replacement. 


Negative Binomial (3.1.3) 
O0<p<l p(x) = (7177 *)pr(1 — p)*, z=0,1,2,... 
al 

p= tt, o? =e 

m(t) = p’[1 — (1 — p)e’}-", log(1 — p) 
Poisson (3.2.1) 
m>0 p(x) =e" ™ 2, C= 016 2o ss 


=m, o7=m 


m(t) = exp{m(e’ —1)}, —co <t< oo 
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List of Common Continuous Distributions 


beta 
a>0oO 
B>O 


Cauchy 


Chi-squared, \7(r) 
r>0 


Exponential 
A> 0 


BPP its) 
T1 >0 
rT2 >0>0 


Gamma, (a, (3) 
a>0d 
B>O0 


(3.3.9) 
f(x) = os get aw Po Ota tT 
2 ap 


= a o-= 
M~ ate? 9 ~ TatBFiy(atay 
k-1 +5 4 
m(t)=14+ D9, (I ath) f, -00 <b < 00 


(1.9.2) 


a2 fly - til = . 
Neither the mean nor the variance exists. 


The megf does not exist. 


(3.3.7 

f(®) = peer 
=r, 0% =2r 
m(t) =(1—2t)-"/?, t<4 
x?(r) <= T(r/2,2) 

r is called the degrees of freedom. 


T[2)-le-a/2 t>0 


(3.3.6) 
fiz) =Ac™,, & >0 
= ? ne | 
=X OF = 32 
m(t) =[1—(t/A)}-', t<a 


(3.6.6) 
F(a) = Tlcatrayalra fray? (ari > 0 


T(r, /2)0(r2/2) (1+ria/rg)1tr2)/2? 


2 ri(r2—4) 
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2 
Ifr2 >2,p= 725. Ifr>4,o >= 2 (5 a) rubry—2 


The megf does not exist. 
r1 is called the numerator degrees of freedom. 
rg is called the denominator degrees of freedom. 


(3.3.2) 
f(x) = Tease ge—le-t/8)  ¢ >0 


w=aB, 0? = af? 


m(t) =(1—Bt)*, t<4% 
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Continuous Distributions, Continued 


Laplace 
—0o0 <0<c 


Logistic 
—0o <60<c 


Normal, N (11,07) 
-—wO< LK ce 
a>0 


r>0 


Uniform 
-w<a<b<ow 


(2.2.4 
f(x) = - e|?-4l 99 < & < 00 
w=0, co? =2 


(6.1.8) 

_ exp{—(x—6)} 
f(t) = Trapte@-ay? 0 << 00 
Lh= 0, o == 


3 
mt) =e"T(1-—2r0 +t), -1<t<1 


(3.4.6) 
I(x) = ez exp {-4 (SH)*}, -00 <a < 00 


=p, 0? =07 


m(t) = exp{ut + (1/2)07t?}, —oo <t <0 


(3.6.2) 

— Pr+))/2] 1 
i@= Val (r/2) (14+a2/ry ery?» 
Ifr>1,u=0. Ifr> 2, eras. 

The mef does not exist. 
The parameter r is called the degrees of freedom. 


—o0 <X< OO 


m(t) = Sas, —00 <t< oo 


Appendix D 


Tables of Distributions 


Prior to the current age of computing, probability tables for certain distributions 

were part of many text books in probability and statistics. These are not needed 

any longer. Most statistical computing packages offer easy-to-use calls to determine 

these probabilities and quantiles. This is certainly true of the language R as we have 

discussed through out this text. Also, many hand calculators have such functions. 
Tables for the following distributions are presented: 


Table I Selected quantiles for chi-square distributions. 


Table II Cumulative distribution function for the standard normal random 
variable. 


Table III Selected quantiles for t-distributions. 


Table IV Selected quantiles for F-distributions. 
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Table I 
Chi-Square Distribution 


The following table presents selected quantiles of chi-square distribution, i.e., the 
values x such that 


ia i} 
P p< < = r/2-1 —w/2 
(X <2) | Toya” e€ w, 


for selected degrees of freedom r. The R function chistable.s generates this table. 


P(X <2) 
0.010 0.025 0.050 0.100 0.900 0.950 0.975 0.990 
0.000 0.001 0.004 0.016 2.706 3.841 5.024 6.635 
0.020 0.051 0.103 0.211 4605 5.991 7.878 9.210 
0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 
0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 
0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 
0.872 1.237 ©=61.635  =—-2.204) 10.645 = 12.592 14.449 16.812 
1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 
1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 
2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 
2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 
3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 
3.571 4.404 5.226 6.304 18.549 21.026 23.337 26.217 
4.107 5.009 5.892 7.042 19.812 22.362 24.736 27.688 
4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 
5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 
5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 
6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 
7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 
7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 
8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 
8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 
9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 
10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 
10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 
11.524 13.120 14.611 16.473 34.382 37.652 40.646 44.314 
12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 
12.879 14.573 16.151 18.114 36.741 40.113 43.195 46.963 
13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 
14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 
14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 


Nr RP RP eRe Re ee ee 
SCO ONDOBRWNFOO ON DWUTRWNFHIR 


iw) 
— 


WnNOnDnnwr NNN DWND LW 
SO OND OK W bw 
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Table IT 
Normal Distribution 


The following table presents the standard normal distribution. The probabilities 
tabled are 


eo! /2 dw, 


PZ <2)=9(2)= [ : 


oo V2 


Note that only the probabilities for z > 0 are tabled. To obtain the probabilities 
for z < 0, use the identity 6(—z) = 1— ®(z). At the bottom of the table, some 
useful quantiles of the standard normal distribution are displayed. The R function 
normaltable.s generates this table. 


z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 
5000 .5040 .5080 .5120 .5160 8.5199 = .5239) 5279) 5319 5359 
5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 8.5714 — 5753 
: 5793  .5832  .5871 .5910 .5948 .5987 .6026 .6064 .6103  .6141 
0.3 6179 .6217 .6255 .6293 =.6331 =.6368 ~— «6406 = 64436480 «6517 
0.4 6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 
0.5 6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 = .7190 £7224 
0.6 7257 672917324 £7357 67389 £7422 .7454 £7486 £7517 £7549 
0.7 -7580  .7611 = .7642) = £7673) = 677047734 £7764 £7794 #7823 .7852 
0.8 -7881 .7910  .7939 = .7967 = .7995-~—S «802380518078 8106 8133 
0.9 8159 = .8186 = .8212)— 8238) 82648289) 83158340) 8365 8389 
1.0 8413 .8438 .8461 .8485 .8508 .8531 = =.8554— 8577 = 8599 ~——.8621 
Li .8643  .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 = .8830 
1.2 8849 .8869 .8888 )=..8907_ ~—s 8925S 89448962 = .8980)— 8997 ~—. 9015 
1.3 9032 .9049 .9066 .9082 .9099 .9115 9131 .9147 .9162 9177 
1.4 9192 .9207 =.9222) 9236) 9251 = .9265 =) 9279) .9292 9306 ~—.9319 
1.5 9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 9441 
1.6 9452 .9463 .9474 9484 .9495 .9505 .9515 .9525 .9535 .9545 
lL, 9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 
1.8 9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699  .9706 
1.9 9713 9719 9726 9732 .9738 9744 9750 .9756 9761 .9767 
2.0 9772 9778 9783 9788 9793 .9798 .9803 .9808 .9812 9817 
2.1 9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 
2.2 9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887  .9890 
2.3 9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 
2.4 9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 
2.5 9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 
2.6 9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 
2.7 9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 
2.8 9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980  .9981 
2.9 9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 
3.0 9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 
3.1 9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993  .9993 
3.2 9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995  .9995 
3.3 9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 
3.4 9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997  .9998 
3.5 9998 .9998 .9998 .9998 .9998 .9998 .9998 .9998  .9998  .9998 


a 0.400 0.300 0.200 0.100 0.050 0.025 0.020 0.010 0.005 0.001 


Za 0.253 0.524 0.842 1.282 1.645 1.960 2.054 2.326 2.576 3.090 
Za/2 | 0-842 1.086 1.282 1.645 1.960 2.241 2.326 2.576 2.807 3.291 
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t-Distribution 


Tables of Distributions 


The following table presents selected quantiles of the t-distribution, i.e., the values 


t such that 


U(r +.1)/2) 
ersi= | area rere 


for selected degrees of freedom r. The last row gives the standard normal quantiles. 


0.900 0.950 
3.078 6.314 
1.886 2.920 
1.638 2.353 
1.533 2.132 
1.476 2.015 
1.440 1.943 
1.415 1.895 
1.397 1.860 
1.383 1.833 
1.372 1.812 
11 | 1.363 1.796 
12 | 1.356 = 1.782 
13) 1.350 1.771 
14 | 1.345 1.761 
15 | 1.341 1.753 
16 | 1.337 1.746 
17 | 1.833 1.740 
18 | 1.330 1.734 
19 | 1.328 1.729 
20 | 1.825 1.725 
21 | 1.823 1.721 
22) 1.3821 1.717 
23 | 1.319 1.714 
24) 1.318 1.711 
25 | 1.316 1.708 
26 | 1.315 1.706 
27 | 1.314 1.7038 
28 | 1.313 1.701 
29 | 1.311 1.699 
30 | 1.310 1.697 
oo | 1.282 1.645 


et 
OO ON DOB WNMHI]H 


PUT <1) 
0.975 0.990 
12.706 31.821 
4.303 6.965 
3.182 4.541 
2.776 3.747 
2.571 3.365 
2.447 3.143 
2.365 2.998 
2.306 2.896 
2.262 2.821 
2.228 2.764 
2.201 2.718 
2.179 2.681 
2.160 2.650 
2.145 2.624 
2.131 2.602 
2.120 2.583 
2.110 2.567 
2.101 2.552 
2.093 2.539 
2.086 2.528 
2.080 2.518 
2.074 2.508 
2.069 2.500 
2.064 2.492 
2.060 2.485 
2.056 2.479 
2.052 2.473 
2.048 2.467 
2.045 2.462 
2.042 2.457 
1.960 2.326 


0.995 0.999 
63.657 318.309 
9.925 22.327 
5.841 = 10.215 
4.604 7.173 
4.032 5.893 
3.707 5.208 
3.499 4.785 
3.355 4.501 
3.250 4,297 
3.169 4.144 
3.106 4.025 
3.055 3.930 
3.012 3.852 
2.977 3.787 
2.947 3.733 
2.921 3.686 
2.898 3.646 
2.878 3.610 
2.861 3.579 
2.845 3.552 
2.831 3.527 
2.819 3.505 
2.807 3.485 
2.797 3.467 
2.787 3.450 
2.779 3.435 
2.771 3.421 
2.763 3.408 
2.756 3.396 
2.750 3.385 
2.576 3.090 
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Table IV 
F-Distribution 
Upper 0.05 Critical Points 


The following table presents selected 0.95 and 0.99 quantiles of the F-distribution, 
ie., for a = 0.05, 0.01, the values Fi.(71, 72) such that 


co r1/2 ry /2-1 
a = P(X > F,(r1,72)) = | ee 5 dw, 
Fa(rayro) (11/2) (r2/2)(1 + riw/re)472)/ 


where r; and rg are the numerator and denominator degrees of freedom, respectively. 
The R function fp1.r generates this table. 


Fo.05(r1, 72) 
ry 


r2 1 2 3 4 5 6 7 8 9 
1 |} 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 
‘4 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 


10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 
oo 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 
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Table IV 
F-Distribution, Continued 
Upper 0.05 Critical Points 


Generated by the R function fp2.r. 


Fo.o5(r1, 72) 
TL 
10 15 20 25 30 40 60 120 [eve) 

241.88 245.95 248.01 249.26 250.10 251.14 252.20 253.25 254.31 
19.40 19.43 19.45 19.46 19.46 19.47 19.48 19.49 19.50 
8.79 8.70 8.66 8.63 8.62 8.59 8.57 8.55 8.53 

5.96 5.86 5.80 Dill 5.75 5.72 5.69 5.66 5.63 

4.74 4.62 4.56 4.52 4.50 4.46 4.43 4.40 4.36 

4.06 3.94 3.87 3.83 3.81 ott 3.74 3.70 3.67 

3.64 3.51 3.44 3.40 3.38 3.34 3.30 o.2f 3:23 

3.35 3.22 3.15 3.11 3.08 3.04 3.01 2.97 2.93 

3.14 3.01 2.94 2.89 2.86 2.83 2.79 2.75 271 
10 2.98 2.85 Pag 2.73 2.70 2.66 2.62 2.58 2.54 
11 2.85 2.72 2.65 2.60 2:58 2.53 2.49 2.45 2.40 
12 2.75 2.62 2.54 2.50 2.47 2.43 2.38 2.34 2.30 
13 2.67 2.53 2.46 2.41 2.38 2.34 2.30 2.25 2.21 
14 2.60 2.46 2.39 2.34 2.31 226 2:22 2.18 2:13 
15 2.54 2.40 2.33 2:28 2.25 2.20 2.16 2.11 2.07 
16 2.49 2.35 2.28 2.23 2.19 2b 5i 2.11 2.06 2.01 
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17 2.45 2.31 2.23 2.18 2.15 2.10 2.06 2.01 1.96 
18 2.41 2227 2.19 2.14 2.11 2.06 2.02 1.97 1.92 
19 2.38 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88 
20 2.35 2.20 2.12 2.07 2.04 1.99 1.95 1.90 1.84 
21 2.32 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81 
22 2.30 2.15 2.07 2.02 1.98 1.94 1.89 1.84 1.78 
23 2.27 2.13 2.05 2.00 1.96 1-91 1.86 1.81 1.76 
24 2.25 2.11 2.03 1.97 1.94 1.89 1.84 1.79 1.73 
25 2.24 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71 
26 2.22 2.07 1.99 1.94 1.90 1.85 1.80 1.75 1.69 
20 2.20 2.06 1.97 1.92 1.88 1.84 1.79 1.73 1.67 
28 2.19 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65 
29 2.18 2.03 1.94 1.89 1.85 1.81 1.75 1.70 1.64 
30 2.16 2.01 1.93 1.88 1.84 1.79 1.74 1.68 1.62 
40 2.08 1.92 1.84 1.78 1.74 1.69 1.64 1.58 1.51 
60 1.99 1.84 1.75 1.69 1.65 1.59 1.53 1.47 1.39 
120 1.91 1.75 1.66 1.60 1.55 1.50 1.43 1.35 1.25 


ee) 1.83 1.67 1.57 1.51 1.46 1.39 1.32 1.22 1.00 
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F-Distribution, Continued 
Upper 0.01 Critical Points 


Table IV 


The R function fp3.r generates this table. 


r2 1 2 
1 | 4052.2 4999.5 
2 98.50 99.00 
3 34.12 30.82 
4 21.20 18.00 
5 16.26 13.27 
6 13.75 10.92 
7 12.25 9.55 
8 11.26 8.65 
9 10.56 8.02 


10 10.04 7.56 
11 9.65 7.21 
12 9.33 6.93 
13 9.07 6.70 
14 8.86 6.51 
15 8.68 6.36 
16 8.53 6.23 
17 8.40 6.11 
18 8.29 6.01 
19 8.18 5.93 
20 8.10 5.85 
21 8.02 5.78 
22 7.95 5.72 
23 7.88 5.66 
24 7.82 5.61 
25 7.77 5.57 
26 7.72 5.53 
27 7.68 5.49 
28 7.64 5.45 
29 7.60 5.42 
30 7.56 5.39 
40 7.31 5.18 
60 7.08 4.98 
120 6.85 4.79 


5403.4 
99.17 
29.46 
16.69 
12.06 
9.78 
8.45 
7.59 
6.99 
6.55 
6.22 
5.95 
5.74 
5.56 
5.42 
5.29 
5.18 
5.09 
5.01 
4.94 
4.87 
4.82 
4.76 
4.72 
4.68 
4.64 
4.60 
4.57 
4.54 
4.51 
4.31 
4.13 
3.95 
3.78 


PL 


5859.0 
99.33 
27.91 
15.21 
10.67 

8.47 
€19 
6.37 
5.80 
5.39 
5.07 
4.82 
4.62 
4.46 
4.32 
4.20 
4.10 
4.01 
3.94 
3.87 
3.81 
3.76 
3.71 
3.67 
3.63 
3.59 
3.56 
3.53 
3.50 
3.47 
3.29 
3.12 
2.96 
2.80 


5928.4 
99.36 
27.67 
14.98 
10.46 
8.26 
6.99 
6.18 
5.61 
5.20 
4.89 
4.64 
4.44 
4.28 
4.14 
4.03 
3.93 
3.84 
3.77 
3.70 
3.64 
3.59 
3.54 
3.50 
3.46 
3.42 
3.39 
3.36 
3.33 
3.30 
3.12 
2.95 
2.79 
2.64 


Fo.o1(r1, 72) 


5981.1 
99.37 
27.49 
14.80 
10.29 
8.10 
6.84 
6.03 
5.47 
5.06 
4.74 
4.50 
4.30 
4.14 
4.00 
3.89 
3.79 
3.71 
3.63 
3.56 
3.51 
3.45 
3.41 
3.36 
3.32 
3.29 
3.26 
3.23 
3.20 
3.17 
2.99 
2.82 
2.66 
2.51 


oo 6.63 4.61 : . : : 


6022.5 
99.39 
27.35 
14.66 
10.16 

7.98 
6.72 
5.91 
5.35 
4.94 
4.63 
4.39 
4.19 
4.03 
3.89 
3.78 
3.68 
3.60 
3.52 
3.46 
3.40 
3.35 
3.30 
3.26 
3.22 
3.18 
3.15 
3.12 
3.09 
3.07 
2.89 
2.72 
2.56 
2.41 
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Table IV 
F-Distribution, Continued 
Upper 0.01 Critical Points 


The R function fp4.r generates this table. 


Fo.o1(r1, 72) 
ry 
10 15 20 25 30 40 60 120 [eve) 
6055.9 6157.3 6208.7 6239.8 6260.7 6286.8 6313.0 6339.4 6365.9 
99.40 99.43 99.45 99.46 99.47 99.47 99.48 99.49 99.50 
27.23 26.87 26.69 26.58 26.50 26.41 26.32 26.22 26.13 
14.55 14.20 14.02 13.91 13.84 13.75 13.65 13.56 13.46 
10.05 9.72 9.55 9.45 9.38 9.29 9.20 9.11 9.02 
7.87 7.56 7.40 7.30 To) 7.14 7.06 6.97 6.88 
6.62 6.31 6.16 6.06 5.99 5.91 5.82 5.74 5.65 
5.81 5.52 5.36 5.26 5.20 5.12 5.03 4.95 4.86 
5.26 4.96 4.81 4.71 4.65 4.57 4.48 4.40 4.31 
10 4.85 4.56 4.41 4.31 4.25 4.17 4.08 4.00 3.91 
11 4.54 4.25 4.10 4.01 3.94 3.86 3.78 3.69 3.60 
12 4.30 4.01 3.86 3.76 3.70 3.62 3.54 3.45 3.36 
13 4.10 3.82 3.66 3.57 3.51 3.43 3.34 3.25 B17 
14 3.94 3.66 3.51 3.41 3.35 3.27 3.18 3.09 3.00 
15 3.80 3.52 3.37 3.28 Beal 3.13 3.05 2.96 2.87 
16 3.69 3.41 3.26 3.16 3.10 3.02 2.93 2.84 2.75 
17 3.59 3.31 3.16 3.07 3.00 2.92 2.83 2.75 2.65 
18 3.51 3.23 3.08 2.98 2.92 2.84 2.75 2.66 2.57 
19 3.43 3.15 3.00 2.91 2.84 2.76 2.67 2.58 2.49 
20 3.37 3.09 2.94 2.84 2.78 2.69 2.61 2.52 2.42 
21 3.31 3.03 2.88 2.79 2.72 2.64 2.55 2.46 2.36 
22 3.26 2.98 2.83 2.73 2.67 2.58 2.50 2.40 2.31 
23 3.21 2.93 2.78 2.69 2.62 2.54 2.45 2.35 2.26 
24 3.17 2.89 2.74 2.64 2.58 2.49 2.40 2.31 2.21 
25 3.13 2.85 2.70 2.60 2.54 2.45 2.36 2.27 2:17 
26 3.09 2.81 2.66 2.57 2.50 2.42 2.33 2.23 2.13 
27 3.06 2.78 2.63 2.54 2.47 2.38 2.29 2.20 2:10 
28 3.03 275 2.60 2.51 2.44 2.35 2.26 2.17 2.06 
29 3.00 2.73 2.57 2.48 2.41 2.33 2.23 2.14 2.03 
30 2.98 2.70 2.55 2.45 2.39 2.30 2.21. 2.11 2.01 
40 2.80 2252) 23t Q22e 2.20 2.11 2.02 1.92 1.80 
60 2.63 2.35 2.20 2.10 2.03 1.94 1.84 1.73 1.60 
120 2.47 2.19 2.03 1.93 1.86 1.76 1.66 1.53 1.38 
co 2.32 2.04 1.88 Lee 1.70 1.59 1.47 1.32 1.00 
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Appendix F 


Answers to Selected 
Exercises 


Chapter 1 1.3.6 }, 
1.2.1 (a) A 1,2,3,4}, {2}; (b) (0,3), 1.3.10 (a)(8)/(8), (b) (P)/(78). 


fer1 Cpe 2}: 


(c) {(z,y):1<a<2,1<y<2). 4.9411 - eye. 
1.2.2 (a) {1 :0<a2< 5/8}. 1.3.13 (b) 1— rey, 
1.2.3 Ci N Co = (mary, mray}. : 

an , 1.3.15 (a) 1— (8)/(*) 
124 (6 (UA) =n40; (A,r SUA, 
1.3.16 n = 23. 
1.2.6 (a) {x: Sass és 

(b) {(a,y) 10 < 22 +y2 <4}. 1.3.19 13. 12(3)(3)/($). 
1.2.7 (a) {x : 2 =2}, (b) ¢ 1.3.22 (a) 0 < 33°, py <1, (b) no. 

(c) {(z,y):2=0, y =O}. 

1.4.3 3. 
1.2.8 (a) 82, (b) 1. 
1.4.4 233 22 26 26 
1.2.9 4,0,1. anes 
LAG, 
1.2.10 5,0, 4. = 
1.4.8 (a) 0.022, (b) 2. 
1.2.11 (a) 4, (b) 0, (c) &. 0) 
1.4.9 2. 
1.2.12 (a) 2, (b) 0. i 
4 
1.2.14 10. 1.4.10 7,7. 
1.3.2 5,5,.5.7- 1.4.12 (c) 0.88. 
1.3.3 . &, a 63 1.4.14 (a) 0.1764. 
1.3.4 0.3. 1.4.15 4(0.7)3(0.3). 
1.3.5 e¢%,1=—e 41, 1.4.16 0.75. 
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a) Dear 4/[20(25 — (@ — 1))] 


x 
Download exi1427.R 


13 39 
1.5.5 (a) Cte = 0,1, 2,3,4,5, 


H(O+Ove. 
1.5.7 3. 


1.5.8 For the plot download ex158.R 
(a) 7, (b) 0, (c) 3, (d) 0. 


1.6.4 2,0 = 0; 4,07 =1,2,3,4,5. 


a) Download dex165.R. 


1.6.8 (2) y=1)8, 270: 


=1:20;sum(4/((25-x+1) *20)) 


Answers to Selected Exercises 


1.7.1 F(z) = ¥2,0< # < 100; 
f(®) = gq 0 <a < 100. 


1.7.6 (a) 37,1; (b) 3, 3. 


1.7.8 (a)1; (b) 4; (c) 2. 


1.7.9 (b) %/1/2; (c) 0. 

1.7.10 Y0.2. 

1.7.12 (a) 1—(1—2)? ,0<2<1; 
(b) 1—4,1<2<oo. 


1.7.13 xe~* ,0 < x < ~w; mode is 1. 


1.7.14 5. 


1.7.17 


ble 


1.7.19 —/2. 


1.7.20 (b) fy(y) =1/(5+y)*?. 
(c) dlife <- 
function(y) {1/(5+y) * (1.2) }. 


1.7.21 (a) f(x) = (5/3)e-*/[1+(2/3)e-2] 7). 


(b) f=function(x) 
{(1+(2/3) exp(-x))*(-5/2)} 


1.7.22 3 ,0<y< 27. 

1.7.24 ATA» 0 <y<o. 

1.7.25 cdf l—e ¥,0<y<o. 

1.7.26 pdf wi O<y<l, 
a ley<4. 

1.8.3 2,86.4, —160.8. 

1.8.4 3,11, 27. 


1.8.5 log -a 50.5 . 


Answers to Selected Exercises 


1.8.8 $7.80. 

1.8.9 (a) 2; (b) pdfis 3 ,1<y<o; 
(c) 2. 

1.8.10 £ 


1.8.13 P|G = —po] = 3, PIG = 1- 


.. .P[G = 50 — po] = 35(0.0045). 


1.8.14 Range of o {2 — po, 5 — po, 8 — 
po}, Probs: 4, 4, a 

1.9.1 (a) 1.5,0.75; (b) 0 
(c) 2, does not exist 


1.9.2 £.t < log2;2;2. 


1.9.12 10; 0; 2; —30. 
1.9.14 (a) —2x2; (b) 0; (c) 2. 


poser 57 50. 


1.9.16 4; 3:3; 


31, 167 
1.9.18 T2) 744° 


1.9.19 E(X") = Ce) 


1.9.20 odd moments are 0, E(X?") = 


(2n)!. 
3 i8 
1.9.24 3; 3. 
1.9.27 (1 — Bt)—1, B, B?. 
1.10.3 0.84. 


1.10.4 P(|X| > 5) = 0.0067. 
Chapter 2 
ie sr 


15 plea 
64' 93 33 3- 


2.1.2 


Ale 


2.1.7 ze-7,0< z< mw. 


2.1.8 —logz,0<z<l. 
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2.1.9 (2) () (sey) / (is) 
x and y nonnegative integers 
such that x+y < 18. 


2 ALT zu —27),0 <2 <1; 
52,0 <a <i. 


2.1.14 


2.1.15 et < log 2. 


2.1.16 (1—te)*(1—t) —te)~?, te < 1, 
ty + te < 1;no. 


1 2 3 4 6 9 


360 386 B86 BGHBGHCS—CsBG” 


2.2.2 


2.2.3 ce 4~? O0< Yi < o@. 
2.2.4 8y1y3,0< yi <1. 


2.2.6 (a) yie %",0 << y1 < 0; 
(b) (1 —#1)7?, #1 <1. 


32142, 6rj+6x141 


2.3.1 G3} 2(6a1+3)2 * 


2.3.2 (a) 2 
(b) 123,0 < Ly <a <1; 
(c) 33; (d) 


1536" 
3a, 305, 
2.3.3 (a) 9 ae} 


(b) pdf is 7(4/3)"y°,0 < y < 3; 

(c) E(X) = E(Y) = 3: 

Var(X1) = aan > Var(Y) = —. 
2.3.8 c+10<4%<@m. 


2.3.9 (a) (2) (23) (say mira) (5) tas 2 
nonnegative integers, 71 + ©2 < 5; 


(c) (2) (se: aa) / (say) 


tq <5-— 2}. 


2.3.11 (a) $,0<ag<a1 <1; 
(b) 1- 2. 


2.3.12 (b) e~ 


) 
2.5.1 (a) 1; (b) —1; (c) 0. 


Ig 
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a eg 
2.5.9 4. 2.8.13 WET 
5 
2.4.4 37. 2.8.15 0.801. 
D456 2, Chapter 3 
2.4.6 2:2. 40 
3.1.1 2. 
3 
2.4.8 2 0<y<1. 3.1.4 1-pbinom(34,40,7/8)=0.6162. 
2412 4, 3.1.6 5. 
Sat 2, 
2.4.13 4;4. 16 
43y-43z s.1418 2. 
2.6.1 (g) Fret. 81 
1) (2\2-3 | 
2.6.2 (a) 4:0: 3.1.15 (4) (3) 2 =3,4,5,.... 


(b) (I-t1)~* (Ite) "(Ita)" yes. 3.4.46 5. 
: ul 
2.6.3 pdf is 12(1—y)",0<y<l1. 3.1.18 (a) Negative binomial, parame- 


S64 guid Paya? ters r and T/N. 


3.1.19 (b) Code: ps=c(.3,.2,.2,.2,.1) 


2.6.6 01 (p12 — p13/P23)/72(1 — P33); coll=c() 

oO 3- o3(1 — : for(i in 1:10000 

i(e13 ~ pi2pes)/oa(1 ~ Pas) {coll<-c(coll,multitrial (ps) )} 
2.6.9 (a) 3. table(col1)/10000 


2.7.1 joint pdf yoyze"8,0 << y. <1, $1.20 (a) -82.40 


0< y2 < 1,0 < y3 < oo. 


3.1.22 Z. 
27.2 =< y= |, 
2 ? 24 
vu 3.1.23 =e. 
1 =o 
273 TpOSY< ae SY<9 3.4.95 (a) 2; (b) %; (0) 2. 
2.7.7 dyoysys,0<yi <1. aoe = 
6 
2.7.8 (a) ois a5: (b) (F 4+ Fel)’. 3.1.30 (a) 0.0853; (b) 0.2637; (c) 0.0861, 
ste et 0.2639. 
poe eas 3.2.1 0.09. 
BeBe 3.2.4 4%e~4/al,2 =0,1,2.... 
2.8.5 2.5; 0.25. 
3.2.5 0.84., 0.9858. 
2.8.7 —5; 30.6. 
3.2.11 About 6.7. 
O1 
he S909 6. 
2.8.10 0.265. 3.2.14 2. 


2.8.12 °22.5;65.25, 3.2.16 (a) e~? exp{(1 + e”)e®}. 
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3.3.1 (a) 0.05.; (b) 0.9592 3.4.31 0.159. 

3.3.2 0.831; 12.8. 3.4.33 ¥°(2). 

3.3.3 (6) 0.1355 3.5.1 (a) 0.574; (b) 0.735. 

3.3.4 x?(4). 3.5.2 (a) 0.264; (b) 0.440; (c) 0.433; 
(d) 0.643. 


3.3.6 pdf is 3e7-94,0 < y < ow. 


3.5.7 2. 
3.3.7 2;0.95. 


3.5.8 (38.2, 43.4). 


3.3.14 (a) 0.0839; (b) 0.2424 
7 3.5.17 0.05. 
3.3.15 1. 
3.6.1 0.05. 
3.3.16 \7(2). 3.6.2 1.761. 


3.6.5 (d) 0.0734; (e) 0.0546 
3.6.6 1.732; 0.1817 


op 
3.3.18 255) Teqptietey: 


3.3.19 (a) 20; (b) 1260; (c) 495. 


1. 
3.3.20 30. 3.6.10 777; 3.33. 


3.3.24 (a) (1—6t)~8,t < 3; 3.6.13 (a) f(y) = e¥[1+ (1/s)e¥J]- Oth. 


(b) Tia = 8,8 = 6). 3.7.1 E(X) =(1—8)-%, if B <1. 
3.4.2 0.067; 0.685. 3.7.2 Download dloggamma.R. 
3.4.3 1.645. Chapter 4 
3.4.4 71.4; 189.4. 
4.1.1 (b) 101.15; (c) 55.5; @log2 

3.4.8 0.598. (d) 70.11. 

ci 4.1.2 (b) 201, 293.9, 17.14, 11.72; 
4.1.3 9.5. 

3.4.12 0.90. 

ae 4.1.10 (e) 0.65; 0.95. 

3.4.14 0.461. 4.1.11 (e) 0.92; 0.97 

3.4.15 N(0,1). 4.2.1 (79.21, 83.19), 90%. 

3.4.16 0.433. 4.2.2 (51.82, 150.48) 

3.4.17 053. 4.2.4 (6.46, 24.69). 

3.422 N02): 4.2.5 (0.143, 0.365). 

3.4.28 Mean is \/2/m(a/V1+4+ a7). 4.2.7 (3.7, 5.7). 

3.4.29 0.24. 4.2.8 160. 


3.4.30 0.159. 4.2.9 (a) 1.310; (b) 1.490. 
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4.2.10 c= >k = 1.50. 


a 

4.2.13 ind=rep(0,numb); 
(olla 1c {1 121 <0) tind [4] <1) 

4.2.14 (37,22). 
4.2.16 6765. 
4.2.17 (3.19,3.61). 
4.2.18 (b) (3.625, 29.101). 
A291 (3:39, 1.72), 
4.2.26 135 or 136. 
4.3.1 (c) (0.1637, 0.3642). 
0.4972, 0.6967). 
c) (0.197, 1.05). 


a) 0.00697; (b) 0.0244; (c) 0.0625 


4.3.3 
4.3.4 
4.4.2 


wees, yhet Aye, «eee 


4.4.5 (a) 4,23, 67, 99, 301. 
4.4.5 1—(1—e73)4. 
4.4.6 (a) 3. 

4.4.10 Weibull. 


4.4.11 3 
4.4.12 pdf: (221)(423)(6z3) 
0<4%<1 
it 
4.4.13 =. 
4.4.17 (a) Suses0 <y3<ys <1; 
(b) = — #0 < ys < ya; (¢ ) Sys. 
4.4.18 } 


4.4.19 6uv(ut+v),0O<u<v<l. 
4.4.24 14. 


4.4.25 (a) +2; (b) SO; 


[024° (c) (0.8)*. 
4.4.26 0.824. 
4.4.27 8. 


4.4.28 (a) 1.130; (b) 0.920. 
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4.4.30 (40, 124), 88%. 

4.4.32 (180, 190) and (195, 210). 

4.5.3 1— (3)° +0(3)° log (3) ,6 =1,2. 
4.5.4 0.17; 0.78. 

4.5.8 n = 19 or 20. 

4.5.9 7 (5) = 0.062; 7 (4) = 0.920. 
4.5.10 n& 73;c% 42. 

4.5.12 (a) 0.051; (b) 0.256; 0.547; 0.780. 


4.5.13 (a) 0.154; (b) 0.154. 

4.5.14 (1) 0.11514; (2) 0.0633. 

4.6.5 (b) t = —3.0442, p-value = 0.0033. 
4.6.6 (b) t = 2.034, p-value = 0.06065. 


4.7.1 p-value = 0.0184. 

4.7.2 8.37 > 7.81; reject. 

4.7.4 b< 8 orb> 32. 

4.7.5 2.44 < 11.3;do not reject Ho. 
4.7.6 6.40 < 9.49;do not reject Ho. 
4.7.7 y? = 49.731, p-value = 1.573e — 09. 
4.7.8 k=3. 


4.8.5 F—'(u) =loglu/(1 — u)). 
4.8.7 O0<u< (1/2): 
~*(u) = log[2u]. 
Fo (1/2)<u<1: 
~*(u) = log[2(1 — u)]. 
4.8.8 F~'(u) = log[—log(1 — u)}. 


4.8.18 (a) F-(u) = ul/8, 
(b) e.g., dominated by a 
uniform pdf. 

4.9.4 (a) Glog2. 

4.9.8 Use sz = 20.41; s, = 18.59. 


4.9.10 (a) 7 —% = 9.67; 
20 possible permutations; 
(c) Pr/n”. 
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4.9.11 U0; 


4.10.1 8. 


4.10.4 (a) Beta(n — j +1, 4); 
(b) Beta(n —j +i1-—1,j -74 


n* Vai (es = a) 


2). 


4.10.5 103 (1 — vu, — v2), 
0< V2, U1 + V2 < ilk; 


Chapter 5 


5.1.9 No;Y, — +. 


5.2.1 Degenerate at p. 
5.2.2 Gamma(a = 1, 6 = 1). 
5.2.3 Gamma(a = 1,6 = 1). 
5.2.4 Gamma(a = 2,3 = 1). 
5.2.7 Degenerate at (3. 
5.2.9 0.682. 

pchisq (60,50) 

- pchisg(40,50)=.686 

5.2.10 Download function cdistp1t4. 


5.2.11 (a) 1-pbinom(55,60, .95)=0.820 
(b) 0.815. 


5.2.14 Degenerate at jg + 3 (x — p11). 
5.2.15 (b) N(0,1). 

5.2.17 (b) N(0,1). 

5.2.20 < 


5.3.2 0.954. 
5.3.3 0.604. 
5.3.4 0.840. 
5.3.5 0.728. 
5.3.7 0.08. 
5.3.9 0.267. 


Chapter 6 
6.1.1 (a) 6=X/4. (c) 5.03 


6.1.2 (a) —n/log([]j_, Xi). 
(b)¥%4 — min{X1, aan Xn}. 
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6.1.4 (a) Y, = max{X,... 
(b) (2n + 1)/(2n). 
(c) \/1/2¥n. 


6.1.5 (a) X = 0U/?, U is unf(0, 1). 
(b) 7.7, 5.4. 


Xn}. 


6.1.6 1 — exp{—2/X}. 


pup) 
—0.534. 


® 0.3597. 
6.1.8 (b) 
6.1.9 Ze—*/2., 0.2699. 


6.1.10 max {5,X}. 


= 3.547. 
.39, 4.92), Yes. 


(a) F(x) =1—- [6?/(x@ + 6)°]. 

) g=function(n,t) {u=runif (n) 

*((1-u) 7 (-1/3)-1)} 

6.3.1 (b) Test-Stat = 17.28, Reject 

6.3.2 (8) = Ply2(2n) < (80/8)e1] 
+P[x2(2n) > (60/8)cal. 

6.3.8 Reject if 250", Vi < Xt_a/e(2n) 


or 
2a Vi > x4/2(27). 


yn" 


6.3.17 (a) x2, = {\/nI(X)(X — O)}?. 
(b) Download waldpois.R. 
(c) xz = 6.90, p — value = 0.0172. 


z/a ve 
6.3.18 (2) 


x exp {- are (4 


6.3.16 (a) (+)"” (a2 
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6.4.1 (a) 0.300, 0.225, 0.350, 0.125. 
(b) CI for po: (0.167, 0.283) 
6.4.2 (a) 7,9, 
Ta [ora (wi C zm)? a 3 
b) Sau, 
ei iy +e 01)? | 7.1.5 d1(y). 
(n +m)? 


7.1.6 b = 0, does not exist. 


6.4.3 6, = min{X;}, = doar (Xi — 81). 7.1.7 does not exist. 


6.4.4 6; = min{ Xj}, 72.8 (0, i=). 

n/ log es x; /8?] : 7.2.9 (a) ne er bay yit(n—r)yr] 
6.4.5 (Yi + Yn)/2, (Yn — Yi)/2;no. (b) ro" [aa ve + (2 — yr. 
6.4.6 (a) X + 1.282,/2=!9; 7.3.2 60y3 (ys — ys)/0?; 

ny 0<y3 <Y5 < 0; 
(b) ® c—X . 6 ys /5; 02/7; 07/35. 
(n-1)/nS 
7.3.3 gre 1/90 < yo < yi < Ow; 
6.4.7 (a) mle is 0.7263, p = 0.76 yi /2; 02/2 
A run of BS: (0.629, 0.828). 
Via p : (0.642, 0.878). 1308 SON aS 
: _ (n+ 1)Y,/n. 
6.4.8 (a) mle is 64.83, x45) = 64.6 
6.4.9 If 4 < %, then p; = # and 7.3.6 6X. 
Po = @; else, pi = po = *. 7.4.2 (a) X; (b) X 
6.5.1 ¢ = —8.64, p — value = 0.0001. 7.4.3 Y/n. 
6.5.2 (81.30004, 81.30156). 7.4.5 Y,— 4. 
6.5.3 (b) (—0.0249, 0.1749). 7.4.7 (a) Yes; (b) yes. 
n 2 
6.5.6 (b) CREE. 7.4.8 (a) E(X) =0. 
a 7.4.9 (a) max{—Yj,0.5Y;,,}; (b) yes; 
— xX ’ ’ ’ 
6.5.7 F= >: (c) yes. 
6.5.8 (b) F = 2/7 = 0.3389, Reject. 7.5.1Y¥,= oy Xi; Y1/4n; yes. 
[max{—X1,X,,, }]"![max{—¥i,Yn, }]"2 _ 
6.5.9 C—Traxl XL YOOX. Yas} aes 7.5.4 T/a. 
x"(2). 7.5.9 T. 


6.6.8 The R function mixnormal, at site 75.44 (b) Y:/n; (c) 6; (d) Yi /n. 


listed in the Preface 
produced these results: 


—2 1 
(first row are initial estimates, sec- 7.6.1 X — a 


ond row are the estimates after 500 ito 
iterations): 7.6.2 Y°/(n* + 2n). 
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7.6.3 (a) 0.8413; (b) 0.7702 (c) Our run 


0.0584 


7.6.4 (a) 49.4; (b) Our run: 4.405 


(+s): 
0488) 


7.6.6 (a) (4+ 
(Db) 


(c) N (0 


Y 
nX 


3s|s 
—" 


7.6.9 l—e 


7.6.10 (b) X; (c) X; (d) 1/X. 
7.7.3 Yes. 


7.7.5 (a) HOA /ats. 


(b) Download bootse6.R 
10.1837; Our run: 1.156828 


7.7.6 (b) oe, GO 


7.7.7 (a) K = (P((n — 1)/2)/T(n/2)) 


x /((n — 1)/2) 
mvue = ®-!(p)KS +7 
(c) 59.727; Our run 3.291479. 


(a) st Whar (Xin — Xi) 
x (Xjn — X35); 
(b) Soja Xi. 


17.10 (Sie eye) 


7.8.3 ¥i5j (= Yi)". 


7.7.9 


7.9.13 (a) T'(3n, 1/6), no; 

(c) (8n — 1)/Y; 

(ec) Beta(3, 3n — 3). 
Chapter 8 
8.1.4 ae > 18.3; yes; yes. 
8.15 [Eee 
8.1.6 337, 2 +27, Gi Se. 
8.1.7 About 96; 76.7. 
8.1.8 JJ}, [2:(1 — 2i)] >. 
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8.1.9 About 39; 15. 
8.1.10 0.08; 0.875. 
8.2.1 (1 —)9(1 + 96). 


8.2.2 1— h,1 <8. 


8.2.3 1— 6 (45%). 
8.2.4 About 54; 5.6. 
8.2.7 Reject Ho if F > 77.564. 
8.2.8 About 27; reject Ho if F < 24. 
8.2.10 I'(n, 6); 

Reject Ho if >, vi >. 


8.2.12 (b)$; (c) 4. 


(d) reject if y = 0; 


if y = 1, reject with probability z. 


8.3.1 (b) t = —2.2854, p = 0.02393; 
(c) (—0.5396 — 0.0388). 


8.3.5 (d) n= 90. 
8.3.6 78; 0.7608. 


8.3.10 Under Ay, (04/03) F has 
an F'(n — 1,m — 1) distribution. 


8.3.12 Reject Ho if |y3 — 90| > ¢. 
8.3.14 (a) []J",(1—ai) >. 
8.3.17 (b) F = 1.34; p = 0.088. 
8.4.1 5.84n — 32.42; 5.84n + 41.62. 
8.4.2 0.04n — 1.66; 0.04n + 1.20. 
8.4.4 0.025, 29.7, —29.7. 
8.5.5 (9y — 20x) /30 < c > (a, 
8.5.7 2w? + 8we > c => (wi, we) € I. 
Chapter 9 

9.2.3 6.39. 

9.2.6 (b) F = 1.1433, p = 0.3451. 
9.2.7 7.875 > 4.26; reject Ho. 

9.2.8 10.224 > 4.26; reject Ho. 


y) € 2nd. 


730 


9.3.2 2r +40. 


9.3.3 (a) 5m/3; (b) 0.6174; 0.9421; 
(c) 7 


9.4.1 None. For B — C: (—0.199, 10.252). 


9.4.2 No significant differences. 


9.4.3 (a) Cl’s of form: (4.2.14) using 
a/k. 


9.4.4 (a)(—0.103, 0.0214) 
(b) x2 = 24.4309, p = 0.00367, 
(—0.103, 0.021). 


9.5.6 7.00; 9.98. 
9.5.8 4.79; 22.82: 30.73. 


9.5.10 (a) 7.624 > 4.46, reject Ha; 
(b) 15.538 > 3.84, reject Hp. 


9.5.11 8;0;0;0;0; —3; 1; 2; —2; 
2; —2: 2:2: —2: 2° 2:0: 0; 0; 0. 


9.6.1 N(a*,o2(n-1+2?/ \(a; —F)?)). 


9.6.2 (a) 6.478+4.4832; (d) (—0.026, 8.992). 


9.6.3 (a) —983.8868 + 0.50412. 


9.6.8 PI: (3.27, 3.70) 


9.6.10 B=n 0, ¥i/ai; 
ve’ SA Vela) 


9.6.14 a = 3. 
9.7.2 Reject Ho. 


9.7.6 Lower Bound: tanh E log +* 


9.7.7 (a) 0.710, (0.555, 0.818); 
(b) Pitchers: 0.536, (0.187, 0.764). 


9.8.2 2; pw’ Aps; ur = po = 0. 


9.8.3 (b) A? = A;tr(A) = 2; 
bw’ Ap/8 = 6. 


9.8.4 (a) \o0?/n?. 
9.8.5 (a) [1+ (n—1)p](o7/n). 
9.9.1 Dependent. 


<n" Y (¥5/29)). 


Za/2 
n—3 
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9.9.3 0,0,0,0. 

9.9.4 7", aij =0. 
Chapter 10 

10.2.3 (a) 0.1148; (b) 0.7836. 


10.2.4 (a) 425; (380, 500): 
(b) 591.18; (508.96, 673.41). 


10.2.9 (a) P(Z > 2 — (a/Vn)6), 
where £(Z) = 0 and Var(Z) = 1; 
(c) Use the Central Limit Theo- 
rem; 


(a) [Ss ° 


V Ai A2(d/e)]. 
10.4.3 Conf.Int for MWW: (0.0483, 00571). 


10.4.2 1 — ®[z, — 


10.4.4 Our run: n, = ng = 39 yielded 
0.8025 power. 


10.3.4 (a) T* = 174, p-value = 0.0083. 
(b) t = 3.0442, p-value = 0.0067. 


n(n—1) 


10.5.3 2). 


10.7.1 (b) (0.156, 0.695). 


10.5.14 (a) Wi =9;W, =6; (b) 1 


(c) 9.5. 


10.8.3 YLg = 205.9 + 0.0152; 
Yw = 211.0+ 0.0102. 


10.8.4 (a) Jj,g = 265.7—0.765(a—1900); 
_ Jw = 246.9 — 0.436(a — 1900); 
(b) Gg = 3501.0-38.35(a—1900); 
Qw = 3297.0 — 35.52(a — 1900). 


2; 


10.8.9 rgc = 16/17 = 0.941 
(zeroes were excluded). 


10.8.10 ry = 0.835; z = 3.734. 
10.9.4 Cases: t< yandt> y. 
10.9.5 (c) y? —o?. 


10.9.7 aja" se —¥); 
(c) y? — 0. 
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10.9.9 0; [4f2(0)]~? 


10.9.14 {p95 = 3.14 + .0282; 
Yw = 0.214 + .0202. 


Chapter 11 

11.1.1 0.45;0.55, 

11.1.3 [yr? + po? /n]/(7? + 07 /n). 
11.1.4 B(yt+a)/(n6 +1). 


— yitor arog 
11.1.6 nta,+a2+a3? n+a,+a2+a3° 


a 2 
11,4,8 (a) (9 — +0400) 
(%) 300(1 — 8). 
11.1.9 V2, y4 < 1; V2y4,1< ys. 


62 
[05+(#1— NBIC +(a2—61)7]° 


10,54. (a) 


11.2.3 (a) 76.84; (b) (76.25, 77.43). 
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) 

11.2.5 (a) 1(0) = 0-7; (d) x7 (2n). 
) beta(n® + 1,n+1— nz). 
) 


11.3.1 (a) Let U; and U2 be iid 
uniform(0,1): 


1. Draw Y = — log(1 — U;) 
2. Draw X = Y — log(1 — U2). 


11.3.3 (b) Fy'(u) = —log(1 — fu), 
O0<u<tl. 


11.3.7 (b) f(aly) is a b(n, y) pmf; 
f(y|x) is a beta(a + a,n — x + £) 
pdf. 


11.4.1 (b) 
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11.4.2 (a) d(y) = 
Jo [res *pY(1—p)”—¥ dp 
ho ea 
Jo [z=atuzp] PY A—p)"-¥ dp 
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F-distribution, 213 Arithmetic mean, 82 
distribution of 1/F’, 217 Arnold, S. F., 201 
mean, 214 Assumptions 
relationship with t-distribution, 217 mle regularity condition (R5), 368 
X-space, 645 mle regularity conditions (RO)—-(R2), 
Y-space, 645 356 
o-field, 12 mle regularity conditions (R3)-(R4), 
mn-rule, 16 362, 
pth-quantile, 257 Asymptotic distribution 
q — g-plot, 260 general scores 
t-distribution, 211 regression, 629 
asymptotic distribution, 330 general scores estimator, 612 
mean, 212 Hodges—Lehmann, 594 
mixture generalization, 221 Mann-Whitney—Wilcoxon estimator 
relationship with F-distribution, 217 for shift, 604 
relationship with Cauchy distribu- sample median, 583, 644 
tion, 217 Asymptotic Power Lemma, 578 
variance, 212 general scores, 611 
regression, 629 
Abebe, A., 600 Mann-Whitney—Wilcoxon, 603 
Absolutely continuous, 49 sign test, 578 
Accept-—reject algorithm signed-rank Wilcoxon, 592 
generation Asymptotic relative efficiency (ARE), 370 
gamma, 299 influence functions, 643 
normal, 299 Mann—Whitney—Wilcoxon and t-test, 
Adaptive procedures, 620 
Additive model, 531 median and mean, 370 
Adjacent points, 259 sign and t-test, 580 
Algorithm signed-rank Wilcoxon and t-test, 592 
accept—reject, 298 Wilcoxon and LS simple regression, 
bisection, 249 629 . 
EM. 405 Asymptotic representation 
abba sampler, 675 influence function, 643 
Newton’s method, 372 mle, 371 ; 
Alternative hypothesis, 267, 469 Asymptotically efficient, 370 
Analysis of variance, 517 
additive model, 531 Bandwidth, 233 
interaction, 535 Bar chart, 231 
one-way, 517 Barplot, 231 
two-way, 531 Basu’s theorem, 462 
Ancillary statistic, 457 Bayes point estimator, 659 
ANOVA, see Analysis of variance Bayes’ theorem, 26 
two-way model Bayesian sequential procedure, 664 
interaction, 535 Bayesian statistics, 656 
Anti-ranks, 587 Bayesian tests, 663 
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Bernoulli distribution, 155 
mean, 155 
variance, 155 
Bernoulli experiment, 155 
Bernoulli trials, 155 
Best critical region, 470 
Neyman-—Pearson Theorem, 472 
Beta distribution, 181 
generation, 303 
mean, 181 
relationship with binomial, 185 
variance, 181 
Big O notation, 335 
Bigler, E., 237 
Binomial coefficient, 17 
Binomial distribution, 156 
additive property, 159 
arcsin approximation, 346 
continuity correction, 345 
mean, 157 
megf, 157 
mixture generalization, 221 
normal approximation, 344 
Poisson approximation, 337 
relationship with beta, 185 
variance, 157 
Birthday problem, 16 
Bisection algorithm, 249 
Bivariate normal distribution, 198 
Bonferroni Procedure, 526 
Bonferroni procedure, 526 
Bonferroni’s inequality, 20 
Boole’s inequality, 19 
Bootstrap, 303 
hypotheses test 
for A = py — x, 309 
hypotheses testing 
for p, 311 
nonparametric, 445 
parametric, 445 
percentile confidence interval 
for 6, 305 
standard errors, 444 
standardized confidence interval, 313 
Borel o-field, 23 
Bounded in probability, 333 
implied by convergence in distribu- 
tion, 333 
Box, G. E. P., 296 
Boxplot, 259 
adjacent points, 259 
lower fence, 259 
potential outliers, 259 
upper fence, 259 
Bray, T. A., 297, 303 
Breakdown point, 644 
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sample mean, 644 
sample median, 645 
Breiman, L., 336 
Burr distribution, 222 
hazard function, 223 


Canty, A., 636 
Capture-recapture, 165 
Carmer, S.G., 528 
Casella, G., 300, 386, 393, 452, 457, 676, 
678, 679, 681, 682 
Cauchy distribution, 60, 67, 73 
megf does not exist, 73 
relationship with t-distribution, 217 
cdf, see Cumulative distribution function 
(cdf) 
n-variate, 134 
joint, 86 
Censoring, 56 
Central Limit Theorem, 342 
n-variate, 351 
normal approximation to binomial, 


statement of, 240 
Characteristic function, 74 
Chebyshev’s inequality, 79 
Chi-square distribution, 178 

kth moment, 179 

additive property, 180 

mean, 178 

normal approximation, 338 

relationship with multivariate nor- 

mal distribution, 202 

relationship with normal, 192 

variance, 178 
Chi-square tests, 283 
Chung, K. L., 70, 322 
Combinations, 17 
Complement, 4 
Complete likelihood function, 405 
Complete sufficient statistic, 433 

exponential class, 435 
Completeness, 431 

Lehmann and Scheffé theorem, 432 
Composite hypothesis, 270 
Compounding, 220 
Concordant pairs, 632 
Conditional distribution 

n-variate, 13 

continuous, 110 

discrete, 110 
Conditional probability, 24 
Confidence coefficient, 239 
Confidence interval, 31, 238 


H1 — B2 
t-interval, 243 
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large sample, 242 
a7, 247 
a? /0%, 248 
Pi — p2 
large sample, 244 
aa a Mann-Whitney—Wilcoxon, 


5 

based on signed-rank Wilcoxon, 595 
binomial exact interval, 250 
bootstrap 

standardized, 313 
confidence coefficient, 239 
confidence level, 239 
discrete random variable, 249 
equivalence with hypotheses testing, 


large sample, mle, 371 


large sample, 240 
median, 584 
distribution-free, 262 
percentile bootstrap interval for 6, 
305 
pivot, 239 
Poisson exact interval, 251 
proportion 
large sample, 241 
quantile €, 
distribution-free, 261 
Confidence level, 239 
Conjugate family of distributions, 666 
Conover,W. J., 496 
Consistent, 324 
Contaminated normal distribution, 194 
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implies convergence in distribution, 
33 


random vector, 349 
Slutsky’s Theorem, 333 


Convex function, 81 


strictly, 81 


Convolution, 108 
Correlation coefficient, 126 


sample, 552 


Countable, 3 


set, 3 


Countable intersection, 6 
Countable union, 6 
Counting rule, 16 


mn-rule, 16 
combinations, 17 
permutations, 16 


Covariance, 125 


linear combinations, 151 


Coverage, 317 
Craig, A. T., 564 
Credible interval, 662 


highest density region (HDR), 668 


Crimin, K., 600 

Critical region, 268, 469 

Cumulant generating function, 77 
Cumulative distribution function (cdf), 


39 
n-variate, 134 
bivariate, 86 
empirical cdf, 570 
joint, 86 
properties, 41 


CUSUMS, 506 


Contaminated point-mass distribution, 641D’Agostino, R. B., 259 


Contingency tables, 287 
Continuity correction, 345 
Continuity theorem of probability, 19 
Contour, 202 
Contours, 198 
Convergence 
bounded in probability, 333 
distribution, 327 
n-variate, 351 
same as limiting distribution, 327 
Central Limit Theorem, 342 
Delta (A) method, 335 
implied by convergence in proba- 
bility, 332 
implies bounded in probability, 333 
mef, 336 
mef 
n-variate, 351 
probability, 322 
consistency, 324 


Data 


Zea mays, 267, 272, 589 
squeaky hip replacements, 228 
AZT doses, 282 

baseball, 243 


Bavarian sulfur_dioxide concentra- 
tions, , 24 


Boeing airplanes, 227 
Olympic race times, 633, 635 
Punt distance, 630 
punter.rda, 630 


R data 
aztdoses.rda, 282 


bb.rda, 236, 243, 554 
beta30.rda, 375 
braindata.rda, 237 
conductivity.rda, 537 
crimealk.rda, 291 
darwin.rda, 272 
earthmoon.rda, 401 
elasticmod.rda, 519 
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ex6111.rda, 360 t-distribution, 211 

ex763data.rda, 445 Bernoulli, 155 

examp1053.rda, 615 beta, 181 

exercise8316.rda, 499 binomial, 156 

fastcars.rda, 528 bivariate normal, 198 
genexpd.rda, 402 Burr, 222 

lengthriver.rda, 585 Cauchy, 60, 67, 73 
lifetimemotor.rda, 235 chi-square, 178 

mix668.rda, 411 noncentral, 523 

normal50.rda, 395 contaminated normal, 194 
olym1500mara.rda, 545, 633 contaminated point-mass, 641 
punter.rda, 630 convergence, 327 

quailldl.rda, 521 degenerate, 76 

reerl.rda, 548 Dirchlet, 182 

scotteyehair.rda, 231, 288 Dirichlet, 665 
sec951.rda, 539 distribution of kth order statistic, 


sec95set2.rda, 539 
sect76data.rda, 444 
selfrival.rda, 281 
shoshoni.rda, 574 
speedlight.rda, 238 
sulfurdio.rda, 229 
telephone.rda, 548, 627 


255 
double exponential, 106 
extreme-valued, 301 
geometric distribution, 160 
Gompertz, 186 
hypergeometric, 47, 162 
joint distribution of (j, &)th order statis- 


tic, 256 
tempbygender.rda, 497 Laplace, 77, 106, 260 
waterwheel.rda, 600 loggamma 219 | 
Salk polio vaccine, 244 logistic 917. 262. 358 


self and rival times, 281 

Shoshoni rectangles, 574 

squeaky hip replacement, 228, 241 
telephone, 548, 627 

two-sample generated, 615 
two-sample, variances, 500 


marginal, 90 

marginal pdf, 91 

mixture distribution, 218 
multinomial distribution, 160 
multivariate normal, 201 
negative binomial, 678 


water wheel, 600, 605 negative binomial distribution, 159 
Davison, A. C., 304, 306 noncentral t. 492 
Decision function, 414 normal, 188 
Decision rule, 268, 414 of a random variable, 37 
Degenerate distribution, 76 order statistics, joint, 254 
Delta (A) method, 335, 346 Pareto, 222 

n-variate, 353 point-mass, 641 

arcsin approximation to binomial, 346 Poisson, 168 

square-root transformation to Pois- predictive, 666 

son, 348 Rayleigh, 186 

theorem, 335 shifted exponential, 327 
DeMorgans laws, 6 skewed contaminated normal, 494 
Density estimation, 233 skewed normals, 197 
Devore, J.L., 519 standard normal, 187 
Dirichlet distribution, 182, 665 Studentized range, 527 
Discordant pairs, 632 trinomial, 161 
Disjoint events, 5 truncated normal, 195 
Disjoint union, 5, 12 uniform, 50 
Dispersion of a distribution, 52 Waring, 224 
Distribution, 47, 259 Weibull, 185 

F-distribution, 213 Distribution free, 261 

noncentral, 524 Distribution free test, 573 


log F-family, 467 Distributions 
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exponential, 176 
gamma, 174 
mixtures of Continuous and discrete, 


56 
Distributive laws, 6 
Double exponential distribution, 106 
DuBois, C., 574 


Efficacy, 579 
general scores, 611 
regression, 629 
Mann-Whitney—Wilcoxon, 603 
sign test, 579 
signed-rank, 592 
Efficiency, 367 
asymptotic, 370 
confidence intervals, 239 
Efficiency of estimator, 367 
Efficient estimator, 366 
multiparameter, 389 
Efron, B., 303, 304, 307, 311 
EM Algorithm, 405 
Empirical Bayes, 679, 682 
Empirical cdf, 570 
simple linear model, 646 
Empirical rule, 191 
Empty set, 5 
Equal in distribution, 40 
Equilikely case, 15 
Estimate, 226 
Estimating equations (EE) 
based on normal scores, 614 
based on sign test, 582 
based on signed-rank Wilcoxon test, 


general scores, 612 
regression, 627 
linear model 


Wilcoxon, 646 


location 
I, 639 


based on LS, 639 
Mann-Whitney—Wilcoxon, 604 
maximun likelihood (mle), 227 
mle, univariate, 357 
simple linear model 

LS, 542 

Estimation, 31 
Estimator, 226 
induced, 570 


maximum likelihood estimator (mle), 


227 
method of moments, 165 
point estimator, 226 
Euclidean norm, 348, 547 
Event, 2 
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Exhaustive, 12 

Expectation, 61 
n-variate, 135 
conditional, 111 


conditional distribution 
n-variate, 


conditional identity, 114 
continuous, 61 
discrete, 61 
function of a random variable, 62 
function of several variables, 93 
independence, 122 
linear combination, 151 
random matrix, 140 
random vector, 97 
Expected value, 61 
Experiment, 1 
Exponential class, 435 
Exponential distribution, 176 
memoryless property, 185 
Exponential family 
uniformly most powerful test, 484 
Exponential family of distributions 
multiparameter, 448 
random vector, 450 


Extreme-valued 
distribution, 301 


Factor space, 645 
Factorial moment, 76 
Fair game, 62 
Finite sample breakdown point, 644 
First Stage Analysis, 526 
Fisher information, 363 
Bernoulli distribution, 364 
beta(0, 1) distribution, 367 
location family, 364 
multiparameter, 388 
location and scale family, 390 
multinomial distribution, 391 
normal distribution, 389 
variance, normal distribution, 393 
Poisson distribution, 367 
Fisher’s PLSD, 528 
Fisher, D. M., 623 
Fitted value, 542 
LS, 542 
Five-number summary, 258 
boxplot of, 259 
Frequency, 2 
Function 
c ? 
n-variate, 134 
joint, 86 
characteristic function, 74 
convex, 81 
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cumulant generating function, 77 
decision, 414 
gamma, 173 
influence, 641 
likelihood, 355 
loss, 414 
marginal 

n-variate, 136 
marginal pdf, 91 
marginal pmf, 90 
megf, 70 

n-variate, 138 
mef several variables, 96 
minimax decision, 415 
pdf, 50 

n-variate, 134 

joint, 87 
pmf, 46 

n-variate, 135 
power, 470 
probability function, 12 
quadratic form, 515 
risk, 414 
score, 364 
sensitivity curve, 639 
set function, 7 

Functional, 569, 640 

location, 570, 640 
scale, 572 
simple linear 

LS, 646 

Wilcoxon, 647 
symmetric error distribution, 571 


Game, 62 
fair, 62 
Gamma distribution, 174 
additive property, 177 
mean, 175 
mef, 174 
Monte Carlo generation, 294 
relationship with Poisson, 183 
variance, 175 
Gamma function, 173 
Stirling’s Formula, 331 
General rank scores, 608 
General rank scores test statistic, 608 


General scores test_statistic 
linear model, 


Gentle, J. E., 300, 303 

Geometric distribution, 160 
memoryless property, 166 

Geometric mean, 82, 439 

Geometric series, 8 

George, E. I., 676, 678 

Gibbs sampler, 675 
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Gini’s mean difference, 265 
Gompertz distribution, 186 
Goodness-of-fit test, 285 
Grand mean, 517 

Graybill, F. A., 391 


Haas, J. V., 600 
Haldane, J. B. .S., 666 
Hampel, F. R., 642 
Hardy, G. H., 334 
Harmonic mean, 82 
Hazard function, 175 
Burr distribution, 223 
exponential, 186 
linear, 186 
Pareto distribution, 223 
aaa T. P., 248, 382, 383, 
467, 496, 548. 569, 599, 607, 
612, 624, 625, 627, 633, 643, 
648, 649 
Hewitt, E., 81 
Hierarchical Bayes, 679 
Highest density region (HDR), 668 
Hinges, 258 
Hinkley, D. V., 304, 306 
Histogram, 230 
Hodges, J. L., 594, 614 
Hodges—Lehmann estimator, 594 
Hogg, R. V., 564, 623 
Hollander, M., 635, 636 
Hsu, J. C., 529 
Huber, P. J., 365, 643 
Hypergeometric distribution, 47, 162 
Hyperparameter, 679 
Hypotheses testing, 267 
alternative hypothesis, 267 
Bayesian, 663 
binomial proportion p, 269 
power function, 269 
bootstrap 
for p, 311 
bootstrap test 
for A= Ly — Px, 309 
chi-square tests, 283 
for independence, 288 
goodness-of-fit test, 285 
homogeneity, 287 
composite hypothesis, 270 
critical region, 268 
decision rule, 268 
distribution free, 573 
equivalence with confidence intervals, 


for 1 — 
t-test, 578 
general rank scores, 608 
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general scores 
regression, 626 
likelihood ratio test, see Likelihood 


ratio test 
Mann-Whitney—Wilcoxon test, 599 


mean 
t-test, 272 
large sample, 271 
large sample, power function, 271 
two-sided, large sample, 276 
median, 573 
Neyman-—Pearson Theorem, 472 
null hypothesis, 267 
observed significance level (p-value), 


279 

one-sided hypotheses, 275 

permutation tests, 310 

power, 269 

power function, 269 

randomized tests, 279 

sequential probability ratio test, 502 

signed-rank Wilcoxon, 587 

significance level, 271 

simple hypothesis, 270 

size of test, 268 

test, 268 

two-sided hypotheses, 275 

Type I error, 268 

Type I error, 268 

uniformly most powerful critical re- 
gion, 479 

uniformly most powerful test, 479 


Idempotent, 559 
Identity 

conditional expectation, 114 
iid, 140, 152 
Improper prior distributions, 667 
Inclusion exclusion formula, 20 
Independence 

n-variate, 137 

expectation, 122 

mef, 122 

random variables 

bivariate, 118 

Independent, 28 

events, 28 

mutually, 29 
Independent and identically distributed, 


140 
Induced estimator, 570 
Inequality 
Bonferroni’s inequality, 20 
Boole’s inequality, 19 
Chebyshev’s, 79 
conditional variance, 114 
correlation coefficient, 133 
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Jensen’s, 81 
Markov’s, 79 
Rao—Cramér lower bound, 365 
Infimum, 688 
Influence function, 641 
Hodges—Lehmann estimate, 643 
sample mean, 642 
sample median, 643 
simple linear 
LS, 648 
Wilcoxon, 648 
Instrumental pdf, 298 
Interaction parameters, 535 
Interquatile range, 52 
Intersection, 5 
countable intersection, 6 


Jacobian, 55 
n-variate, 144 
bivariate, 102 
Jeffreys’ priors, 671 
Jeffreys, H., 671 
Jensen’s inequality, 81 
Johnson, M. E., 496 
Johnson, M. M., 496 
Joint sufficient statistics, 447 
factorization theorem, 447 
Jointly complete and sufficient statistics, 


Jones, M. C., 233 


Kendall’s 7, 632 
estimator, 632 
null properties, 633 
Kennedy, W. J., 300, 303 
Kernel 
rectangular, 233 


Kitchens, L.J., 528 

Kloke, J. D., 244, 291, 521, 569, 615, 623, 
624, 627, 636, 649 

Krishnan, T., 404, 409 

Kurtosis, 76 


Laplace distribution, 77, 106, 260 

Law of total probability, 26 

Least squares (LS), 541 

Lehmann and Scheffé theorem, 432 

Lehmann, E. L., 233, 277, 334, 383, 386, 
389, 393, 398, 423, 452, 457, 
487, 488, 594, 614, 676, 679, 
681, 682 

Leroy, A. M., 627 

Likelihood function, 227, 355 

Likelihood principle, 417 

Likelihood ratio test, 377 

asymptotic distribution, 379 
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beta(0,1) distribution, 381 
exponential distribution, 377 
for independence, 553 
Laplace distribution, 381 
multiparameter, 396 
asymptotic distribution, 398 
multinomial distribution, 398 
normal distribution, 396 
two-sample normal distribution, 401 
variance of normal distribution, 401 
normal distribution, mean, 378 
relationship to Wald test, 380 
two-sample 
normal, means, 488 
normal, variances, 495, 496 
Limit infimum (liminf), 331, 689 
Limit supremum (limsup), 331, 689 
Linear combinations, 151 
Linear discriminant function, 512 
Linear model, 540, 645 
matrix formulation, 547 
simple, 625 
Little o notation, 335 
Local alternatives, 577, 602 
Location and scale distributions, 259 
Location and scale invariant statistics, 


Location family, 364 
Location functional, 570 
Location model, 242, 571, 572 
t-distribution, 217 
normal, 191 
shift (A), 598 
Location parameter, 191 
Location-invariant statistic, 458 
Loggamma distribution, 219 
Logistic distribution, 217, 262, 358 
Loss function, 414 
absolute-error, 416 
goalpost, 416 
squared-error loss, 416 
Lower control limit, 505 
Lower fence, 259 


Main effect hypotheses, 532 
Mann-Whitney—Wilcoxon statistic, 599 
Mann-Whitney—Wilcoxon test, 599 

null properties, 600 

Marginal distribution, 90 

continuous, 91 

Markov chain, 676 

Markov Chain Monte Carlo (MCMC), 


680 
Markov’s inequality, 79 
Marsaglia, G., 297, 303 
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Maximum likelihood estimator (mle), 226, 
227, 357 
multiparameter, 387 
asymptotic normality, 369 
asymptotic representation, 371 
binomial distribution, 228 
consistency, 359 
exponential distribution, 227 
logistic distribution, 357 
multiparameter 
N(, 07) distribution, 387 
Laplace distribution, 387 
multinomial distribution, 391 
Pareto distribution, 394 
normal distribution, 228 
of g(@), 358 
one-step, 373 
relationship to sufficient statistic, 427 
uniform distribution, 230 
McKean, J. W., 244, 248, 291, 382, 383, 
467, 496, 521, 528, 548, 569, 
599, 600, 607, 612, 615, 620, 
623-625, 627, 636, 638, 643, 648, 


649, 653 
McLachlan, G. J., 404, 409 
Mean, 61, 68 


n-variate, 141 
arithmetic mean, 82 
conditional, 111 
linear identity, 128 
geometric mean, 82 
grand, 517 
harmonic mean, 82 
sample mean, 152 
Mean profile plots, 531 
Median, 51, 76, 572 
breakdown point, 645 
confidence interval 
distribution-free, 262 
of a random variable, 51 
sample median, 257 
Method of moments estimator, 165 
mef, see Moment generating function 
Midrange 
sample midrange, 257 
Miller, R. G., 529 
Minimal sufficient statistics, 455 
Minimax criterion, 415 
Minimax principle, 415 
Minimax test, 509 
Minimum chi-square estimates, 286 
Minimum mean-squared-error estimator, 
4 


Minimum variance unbiased estimator, 


see MVUE 
Minitab command 
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rregr, 627 
Mixture distribution, 218, 408 
mean, 218 
variance, 219 


Mixtures of Continuous and discrete dis- 
tributions, 5 


mle, see Maximum likelihood estimator 
(mle) 
Mode, 58 
Model 
linear, 540, 645 
location, 191, 242, 571 
median, 572 
normal location, 191 
simple linear, 625 
Moment, 72 
mth, 72 
about j, 76 
factorial moment, 76 
kurtosis, 76 
skewness, 76 
Moment generating function (mgf), 70 
n-variate, 138 
binomial distribution, 157 
Cauchy distribution (mgf does not 
exist), 73 
convergence, 336 
independence, 122 
multivariate normal, 201 
normal, 188 
Poisson distribution, 169 
quadratic form, 557 
several variables, 96 
standard normal, 187 
transformation technique, 107 
Monotone likelihood ratio, 483 
relationship to uniformly most pow- 
erful test, 483 
regular exponential family, 484 
Monotone sets, 7 
nondecreasing, 7 
nonincreasing, 7 
Monte Carlo, 292, 595, 672 
generation 
beta, 303 
gamma, 294, 299 
normal, 296 
normal via Cauchy, 299 
Poisson, 295 
integration, 295 
sequential generation, 674 
situation, 595 
Monte Hall problem, 36 
Mood’s median test, 616, 618 
Mosteller, F., 258 
Muller, M, 296 
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multinomial distribution, 160 
Multiple Comparison 
Bonferroni, 526 
Tukey-Kramer, 528 
Multiple comparison 
Tukey, 527 
Multiple Comparison Problem, 526 
Bonferroni procedure, 526 
Multiple Comparison Procedure 
Fisher, 528 
Tukey, 527 
Multiplication rule, 16, 25 
mn-rule, 16 
for probabilities, 25 
Multivariate normal distribution, 201 
conditional distribution, 204 
marginal distributions, 203 
mef, 201 
relationship with chi-square distri- 
bution, 202 
Mutually exclusive, 12 
Mutually independent events, 29 
MVUE, 413 
Lt, 454 
binomial distribution, 440 
exponential class of distributions, 438 
exponential distribution, 428 
Lehmann and Scheffé theorem, 432 
multinomial, 450 
multivariate normal, 451 
Poisson distribution, 438 
shifted exponential distribution, 434 


Naranjo, J. D., 620 
Negative binomial distribution, 159, 678 
as a mixture, 220 
mef, 159 
Newton’s method, 372 
Neyman’s factorization theorem, 422 
Neyman-—Pearson Theorem, 472 
Noncentral F-distribution, 524 
Noncentral t-distribution, 492 
Noncentral chi-square distribution, 523 
Noninformative prior distributions, 667 
Nonparametric, 230 
Nonparametric estimate of pmf, 230 
Nonparametric estimators, 570 
Norm, 348 

Euclidean, 348 

pseudo-norm, 651 
Normal distribution, 188 
approximation to chi-square distri- 

bution, 338 

distribution of sample mean, 193 
empirical rule, 191 
mean, 188 
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megf, 188 
points of inflection, 189 
relationship with chi-square, 192 
variance, 188 

Normal equations, 645 

Normal scores, 614 

Null hypothesis, 267, 469 


Observed likelihood function, 405 
Observed significance level, 280 
One-sided hypotheses, 275 
One-step mle estimator, 373 
One-way ANOVA, 517 
First Stage, 526 
Multiple Comparison Problem, 526 
Second Stage, 526 
Optimal score function, 613 
Order statistics, 254 
ith-order statistic, 254 
distribution of kth order statistic, 
25 


5 
joint distribution of (j,k)th, 256 
joint pdf, 254 
Ordinal, 231 
Outlier, 216 


p-value, 280 
Parameter, 156, 191, 225 
location, 191 
scale, 191 
shape, 191 
Pareto distribution, 222 
hazard function, 223 
Partition, 12 
Pearson residuals, 289 
Percentile, 51, see quantile 
Permutation, 16 
Permutation tests, 310 
Plot 
q — q-plot, 260 
boxplot, 259 
mean profile plots, 531 
scatterplot, 540 
pnbinom, 159 
Point estimator, 226, see Estimator 
fa — Hla, 241 
Pi — p2, 244 
asymptotically efficient, 370 
Bayes, 659 
consistent, 324 
efficiency, 367 
efficient, 366 
five-number summary, 258 
median, 257, 572 
midrange, 257 
MVUE, see MVUE 
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pooled estimator of variance, 242 
quantile, 258 
quartiles, 258 
range, 257 
robust, 642 
sample mean, 152 
unbiased, 226 
Point-mass distribution, 641 
Poisson distribution, 168 
additive property, 171 
approximation to binomial distribu- 
tion, 337 
compound or mixture, 220 
limiting distribution, 340 
mean, 170 
megf, 169 
Monte Carlo generation, 295 
relationship with gamma, 183 
square-root transformation, 348 
variance, 170 
Poisson process 
axioms, 168 
Pooled estimator of variance, 242 
Positive definite, 201 
Positive semi-definite, 142, 200 
Posterior, 27 
distribution, 656 
relation to sufficiency, 658 
probabilities, 27 
Potential outliers, 259 
Power function, 269, 470 
Power of test, 269 
Precision, 668 
Predicted value, 542 
LS, 542 
Prediction interval, 245 
Predictive distribution, 666 
Predictor, 625 
Principal components, 206 
nth, 207 
first, 206 
Prior, 27, 655 
distributions, 656 
conjugate family, 666 
improper, 667 
noninformative, 667 
proper, 667 
Jeffreys’ class, 671 
probabilities, 27, 655 
Probability 
bounded, 333 
conditional, 24 
convergence, 322 
equilikely case, 15 
Probability density function (pdf), 50 
n-variate, 134 
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conditional, 110 
joint, 87 
marginal, 91 
n-variate, 136 
Probability function, 12 
Probability interval, 662 
Probability mass function (pmf), 46 
n-variate, 135 
conditional, 110 
joint, 86 
marginal, 90 
Process, 574 
general scores, 609 
regression, 626 
Mann-Whitney—Wilcoxon, 601 
sign, 574 
signed-rank, 590 
Proper prior distributions, 667 
Pseudo-norm, 651 


Quadrant count statistic, 637 
Quadratic form, 515 
matrix formulation, 556 
Quantile, 51 
absolutely continuous case, 52 
confidence interval 
distribution-free, 261 
sample quantile, 258 
Quartile, 51 
Quartiles 
interquartile range, 52 
of a random variable, 51 
sample quartiles, 258 
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abgame, 31 
aresimcn, 596 
barplot, 231 
bday, 17 
binomci, 250 
binpower, 270 
bootsel, 444 
bootse2, 445 
boottestonemean, 311 
boottesttwo, 310 
boxplot, 259 
cdistplt, 338 
chiqsq.test, 285 
cipi, 549 
condsiml, 675 
consistmean, 324 
cor, 633 
cor.boot, 636 
cor.boot.ci, 636 
cor.test, 554, 633, 635 
density, 233 


743 


dgeom, 160 
dhyper, 162 
eigen, 557 
empalphacn, 298 
fivenum, 258 
getcis, 246 
gibbser2, 677 
hierarch1l, 682 
hist, 232 
hogg.test, 623 
interaction.plot, 537 
Im, 537, 627 
mepbon, 527 
mean, 228 
mlelogistic, 373 
mses, 596 
multitrial, 165 
onesampsgn, 262, 584 
oneway.test, 519 
p2pair, 400 
pbeta, 181 
pbinom, 157 
pchisq, 178 
percentciboot, 305 
pf, 214 
pgamma, 175 
piest, 293 
piest2, 296 
pnbinom, 159 
pnorm, 189 
poisrand, 295 
poissonci, 251 
ppois, 170 
prop.test, 228, 241 
pt, 211 

ptukey, 527 
qqnorm, 261 
qqplotc4s2, 261 
quantile, 258 
rcauchy, 246 
ren, 596 

rexp, 402 

rfit, 615, 627 
rscn, 494 

seq, 196 
simplegame, 65 
t.test, 240, 277 
tpowerg, 492 
var, 228 
wil2powsim, 605 
wilcox.test, 589, 601 
ww, 627 

zpower, 277 
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likelihood function, 227 
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sample size, 226 
statistic, 226 
Random variable, 37 
continuous, 37, 49 
discrete, 37, 45 
equal in distribution, 40 
vector, 85 
Random vector, 85 
n-variate, 134 
continuous, 87 
discrete, 86 
Random-walk procedure, 505 
Randomized tests, 279 
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sample range, 257 
Rank-based procedures, 569 
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Rao—Blackwell theorem, 427 
Rao—Cramér lower bound, 365, 613 
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Rasmussen, S., 630 
Rayleigh distribution, 186 
Relative frequency, 2 
Residual, 542 
LS, 542 
Residual plot, 544 


Residuals 
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Robust estimator, 642 
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Sample mean, 152 
consistency, 322 
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distribution under normality, 214 
variance, 152 

Sample median, 257 

Sample midrange, 257 

Sample proportion 
consistency, 326 

Sample quantile, 258 
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Sample quartiles, 258 

Sample range, 257 
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Sample size, 226 
Sample size determination, 580 
t-test, 580 
general scores, 617 
Mann-Whitney—Wilcoxon, 603 
sign test, 580 
two-sample t, 603 
Sample space, 1 
Sample variance 
consistent, 325 
distribution under normality, 214 
Sandwich theorem, 688 
Scale functional, 572 
Scale parameter, 191 
dispersion, 52 
spread, 52 
Scale-invariant statistic, 458 
Scatter plot, 540 
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Score function, 364, 608 
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two-sample sign, 616 
Scores test, 380 
beta(@,1) distribution, 381 
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Sensitivity curve, 639 
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Set function, 7 
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Shift, in location, 598 
Shifted exponential distribution, 327 
Shrinkage estimate, 666 
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Sign statistic, 573 
Sign test, 573 
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Signed-rank Wilcoxon, 586 
Walsh average identity, 589 
Signed-rank Wilcoxon test, 587 
null properties, 588 
Significance level, 271 
Simple hypothesis, 270 
Simulation, 31 
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Size of test, 268, 470 
Skewed contaminated normal distribution, 


494 
Skewed distribution, 51 
Skewed normal distributions, 197 
Skewness, 76 
Slutsky’s Theorem, 333 
Spearman’s rho, 634 
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I(2,0) distribution, 421 
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Bayes’ theorem, 26 
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n-variate, 351 
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two-sided alternative, 488 
Unbiasedness, 226 
Uniform distribution, 50 
Uniformly most powerful critical region, 
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Upper fence, 259 


Variance, 68 
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n-variate, 350 
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